Re: [PR] [HUDI-7032] ShowProcedures show add limit syntax to keep the same [hudi]
xuzifu666 commented on code in PR #9988: URL: https://github.com/apache/hudi/pull/9988#discussion_r1384544659 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala: ## @@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with ProcedureBuilder { val commits: util.List[HoodieInstant] = timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant]) if (commits.isEmpty) Seq.empty[Row] else { - commits.toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq + if (limit.isDefined) { + commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq Review Comment: I try to refactor the code for abstracting out limit in parent class,but seems not fit,because would get parameter from sub-class and cannot confirm the order of limit,so I add ut first. at the same time, show procedure commands are not much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7050]Flink HoodieHiveCatalog supports hadoop parameters [hudi]
hudi-bot commented on PR #10013: URL: https://github.com/apache/hudi/pull/10013#issuecomment-1801234649 ## CI report: * 7272943f1fe1d3c2683fb97bd13f34658e2e04df Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20732) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]
hudi-bot commented on PR #9998: URL: https://github.com/apache/hudi/pull/9998#issuecomment-1801234529 ## CI report: * 420bf60614d20a1caf77bb6616e5fb8d7420b89e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20707) * 9cd41c8ef03048bb724990ecf93c9db3b5883734 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20728) * 4689bc88d7a4df6c42918dcb0fc1cc94bc7a05a6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20731) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
hudi-bot commented on PR #9922: URL: https://github.com/apache/hudi/pull/9922#issuecomment-1801234330 ## CI report: * 71eb41cec2aa93366754e0edf14767febca0c40d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20580) * b351705a990c8ea6b454dade0a33af1090cdf85c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20730) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [BUG] Spark will read invalid timestamp(3) data when record in log is older than the same in parquet. [hudi]
seekforshell commented on issue #10012: URL: https://github.com/apache/hudi/issues/10012#issuecomment-1801229697 > Did you check your table creation schema persisted in `hoodie.properties` about the timestamp precision represented as avro format? yes,here it is: `#Properties saved on 2023-11-06T09:16:47.982Z #Mon Nov 06 17:16:47 CST 2023 hoodie.table.precombine.field=precombine_field hoodie.datasource.write.drop.partition.columns=false hoodie.table.partition.fields=__partition_field hoodie.table.type=MERGE_ON_READ hoodie.archivelog.folder=archived hoodie.compaction.payload.class=org.apache.hudi.common.model.EventTimeAvroPayload hoodie.timeline.layout.version=1 hoodie.table.version=5 hoodie.table.recordkey.fields=source_from,id hoodie.datasource.write.partitionpath.urlencode=false hoodie.table.name=air08_airflow_bucket_mor_t2 hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexAvroKeyGenerator hoodie.datasource.write.hive_style_partitioning=false hoodie.table.create.schema={"type"\:"record","name"\:"record","fields"\:[{"name"\:"source_from","type"\:["null","int"],"default"\:null},{"name"\:"id","type"\:["null","long"],"default"\:null},{"name"\:"name","type"\:["null","string"],"default"\:null},{"name"\:"create_time","type"\:["null",{"type"\:"long","logicalType"\:"timestamp-millis"}],"default"\:null},{"name"\:"price","type"\:["null",{"type"\:"fixed","name"\:"fixed","namespace"\:"record.price","size"\:6,"logicalType"\:"decimal","precision"\:14,"scale"\:2}],"default"\:null},{"name"\:"extend","type"\:["null","string"],"default"\:null},{"name"\:"count","type"\:["null","long"],"default"\:null},{"name"\:"create_date","type"\:["null",{"type"\:"int","logicalType"\:"date"}],"default"\:null},{"name"\:"ext_dt","type"\:["null",{"type"\:"long","logicalType"\:"timestamp-millis"}],"default"\:null},{"name"\:"precombine_field","type"\:["null","string"],"default"\:null},{"name"\:"sync_deleted","type"\:["null","int"],"default"\:null},{"name"\:" sync_time","type"\:["null",{"type"\:"long","logicalType"\:"timestamp-millis"}],"default"\:null},{"name"\:"__binlog_file","type"\:["null","string"],"default"\:null},{"name"\:"__pos","type"\:["null","int"],"default"\:null},{"name"\:"source_sys","type"\:["null","int"],"default"\:null},{"name"\:"__partition_field","type"\:["null","int"],"default"\:null}]} hoodie.table.checksum=3920591838 ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7050]Flink HoodieHiveCatalog supports hadoop parameters [hudi]
hudi-bot commented on PR #10013: URL: https://github.com/apache/hudi/pull/10013#issuecomment-1801226730 ## CI report: * 7272943f1fe1d3c2683fb97bd13f34658e2e04df UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]
hudi-bot commented on PR #9998: URL: https://github.com/apache/hudi/pull/9998#issuecomment-1801226585 ## CI report: * 420bf60614d20a1caf77bb6616e5fb8d7420b89e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20707) * 9cd41c8ef03048bb724990ecf93c9db3b5883734 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20728) * 4689bc88d7a4df6c42918dcb0fc1cc94bc7a05a6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
hudi-bot commented on PR #9922: URL: https://github.com/apache/hudi/pull/9922#issuecomment-1801226273 ## CI report: * 71eb41cec2aa93366754e0edf14767febca0c40d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20580) * b351705a990c8ea6b454dade0a33af1090cdf85c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7050) Flink hoodiehivecatalog supports hadoop parameters
[ https://issues.apache.org/jira/browse/HUDI-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7050: - Labels: pull-request-available (was: ) > Flink hoodiehivecatalog supports hadoop parameters > -- > > Key: HUDI-7050 > URL: https://issues.apache.org/jira/browse/HUDI-7050 > Project: Apache Hudi > Issue Type: Improvement > Components: flink-sql >Reporter: waywtdcc >Priority: Major > Labels: pull-request-available > > Flink hoodiehivecatalog supports hadoop parameters -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7050]Flink HoodieHiveCatalog supports hadoop parameters [hudi]
waywtdcc opened a new pull request, #10013: URL: https://github.com/apache/hudi/pull/10013 ### Change Logs Flink HoodieHiveCatalog supports hadoop parameters ### Impact Flink HoodieHiveCatalog supports hadoop parameters ### Risk level (write none, low medium or high below) low ### Documentation Update Flink HoodieHiveCatalog supports hadoop parameters ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7050) Flink hoodiehivecatalog supports hadoop parameters
waywtdcc created HUDI-7050: -- Summary: Flink hoodiehivecatalog supports hadoop parameters Key: HUDI-7050 URL: https://issues.apache.org/jira/browse/HUDI-7050 Project: Apache Hudi Issue Type: Improvement Components: flink-sql Reporter: waywtdcc Flink hoodiehivecatalog supports hadoop parameters -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [BUG] Spark will read invalid timestamp(3) data when record in log is older than the same in parquet. [hudi]
danny0405 commented on issue #10012: URL: https://github.com/apache/hudi/issues/10012#issuecomment-1801196648 Did you check your table creation schema persisted in `hoodie.properties` about the timestamp precision represented as avro format? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]
danny0405 commented on code in PR #9998: URL: https://github.com/apache/hudi/pull/9998#discussion_r1386080725 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -427,7 +427,7 @@ public class HoodieWriteConfig extends HoodieConfig { public static final ConfigProperty INSTANT_STATE_TIMELINE_SERVER_BASED = ConfigProperty .key("hoodie.instant_state.timeline_server_based.enabled") - .defaultValue(false) + .defaultValue(true) Review Comment: This is only an improvement to Flink writers currently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]
yihua commented on code in PR #10007: URL: https://github.com/apache/hudi/pull/10007#discussion_r1386080312 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala: ## @@ -74,11 +76,25 @@ class NewHoodieParquetFileFormat(tableState: Broadcast[HoodieTableState], override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { if (!supportBatchCalled) { supportBatchCalled = true - supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema) + supportBatchResult = !isMOR && legacyFF.supportBatch(sparkSession, schema) } supportBatchResult } + private def wrapWithBatchConverter(reader: PartitionedFile => Iterator[InternalRow]): PartitionedFile => Iterator[InternalRow] = { Review Comment: Right. @jonvex I think Spark internally handles the batch processing (`InternalRow` vs `ColumnarBatch`) based on the boolean `supportBatch` returns. So we don't have to do batch converter here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Solution for synchronizing the entire database table in flink [hudi]
ad1happy2go commented on issue #9965: URL: https://github.com/apache/hudi/issues/9965#issuecomment-1801186970 @bajiaolong 1. Its not limited to 20, I guess what @danny0405 meant is number of tables should be handful as you may need to manage those number of streams. 2. Not sure if there is a way to only read specific partitions from kafka topic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7030) Log reader data lost as that not consistent behavior in timeline's containsInstant
[ https://issues.apache.org/jira/browse/HUDI-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7030. Resolution: Fixed Fixed via master branch: e731755b99057dc916378f1f7e95c73642ff96e8 > Log reader data lost as that not consistent behavior in timeline's > containsInstant > --- > > Key: HUDI-7030 > URL: https://issues.apache.org/jira/browse/HUDI-7030 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: ann >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.14.1 > > Attachments: image-2023-11-03-19-48-29-441.png, > image-2023-11-03-19-49-22-894.png, image-2023-11-03-19-50-11-849.png, > image-2023-11-03-19-58-39-495.png, image-2023-11-03-20-06-00-579.png, > image-2023-11-03-20-06-13-905.png, image-2023-11-03-20-07-30-201.png > > > Log reader filtered all log data blocks which come from inflight instant. > !image-2023-11-03-19-49-22-894.png! > *containsInstant* return false when input instant's timestamp is not equal as > anyone instant timestamp in inflight timeline. > !image-2023-11-03-20-07-30-201.png! > But now, in timeline's *containsInstant* that input is instant's timestamp, > it would return true. > > When input is the instant with default_millis_ext, instant's timestamp is > less than someone instant timestamp in timeline. > !image-2023-11-03-19-50-11-849.png! > In finally, log reader skipped the completed delta commit instant and caused > data lost. > !image-2023-11-03-19-58-39-495.png! > I think timeline's containsInstant should have consistent behavior and update > containsOrBeforeTimelineStarts to containsInstant > !image-2023-11-03-19-48-29-441.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7030) Log reader data lost as that not consistent behavior in timeline's containsInstant
[ https://issues.apache.org/jira/browse/HUDI-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7030: - Fix Version/s: 1.0.0 0.14.1 > Log reader data lost as that not consistent behavior in timeline's > containsInstant > --- > > Key: HUDI-7030 > URL: https://issues.apache.org/jira/browse/HUDI-7030 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: ann >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.14.1 > > Attachments: image-2023-11-03-19-48-29-441.png, > image-2023-11-03-19-49-22-894.png, image-2023-11-03-19-50-11-849.png, > image-2023-11-03-19-58-39-495.png, image-2023-11-03-20-06-00-579.png, > image-2023-11-03-20-06-13-905.png, image-2023-11-03-20-07-30-201.png > > > Log reader filtered all log data blocks which come from inflight instant. > !image-2023-11-03-19-49-22-894.png! > *containsInstant* return false when input instant's timestamp is not equal as > anyone instant timestamp in inflight timeline. > !image-2023-11-03-20-07-30-201.png! > But now, in timeline's *containsInstant* that input is instant's timestamp, > it would return true. > > When input is the instant with default_millis_ext, instant's timestamp is > less than someone instant timestamp in timeline. > !image-2023-11-03-19-50-11-849.png! > In finally, log reader skipped the completed delta commit instant and caused > data lost. > !image-2023-11-03-19-58-39-495.png! > I think timeline's containsInstant should have consistent behavior and update > containsOrBeforeTimelineStarts to containsInstant > !image-2023-11-03-19-48-29-441.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7030] Update containsInstant without containsOrBeforeTimelineStarts to fix data lost (#9982)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e731755b990 [HUDI-7030] Update containsInstant without containsOrBeforeTimelineStarts to fix data lost (#9982) e731755b990 is described below commit e731755b99057dc916378f1f7e95c73642ff96e8 Author: xoln ann AuthorDate: Wed Nov 8 14:39:32 2023 +0800 [HUDI-7030] Update containsInstant without containsOrBeforeTimelineStarts to fix data lost (#9982) --- .../hudi/client/functional/TestHoodieIndex.java | 21 + .../table/timeline/HoodieDefaultTimeline.java | 2 +- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java index 4518b909813..37199c783bb 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java @@ -553,6 +553,27 @@ public class TestHoodieIndex extends TestHoodieMetadataBase { assertFalse(timeline.empty()); assertFalse(HoodieIndexUtils.checkIfValidCommit(timeline, instantTimestamp)); assertFalse(HoodieIndexUtils.checkIfValidCommit(timeline, instantTimestampSec)); + +// Check the completed delta commit instant which is end with DEFAULT_MILLIS_EXT timestamp +// Timestamp not contain in inflight timeline, checkContainsInstant() should return false +// Timestamp contain in inflight timeline, checkContainsInstant() should return true +String checkInstantTimestampSec = instantTimestamp.substring(0, instantTimestamp.length() - HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT.length()); +String checkInstantTimestamp = checkInstantTimestampSec + HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT; +Thread.sleep(2000); // sleep required so that new timestamp differs in the seconds rather than msec +String newTimestamp = writeClient.createNewInstantTime(); +String newTimestampSec = newTimestamp.substring(0, newTimestamp.length() - HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT.length()); +final HoodieInstant instant5 = new HoodieInstant(true, HoodieTimeline.DELTA_COMMIT_ACTION, newTimestamp); +timeline = new HoodieDefaultTimeline(Stream.of(instant5), metaClient.getActiveTimeline()::getInstantDetails); +assertFalse(timeline.empty()); +assertFalse(timeline.containsInstant(checkInstantTimestamp)); +assertFalse(timeline.containsInstant(checkInstantTimestampSec)); + +final HoodieInstant instant6 = new HoodieInstant(true, HoodieTimeline.DELTA_COMMIT_ACTION, newTimestampSec + HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT); +timeline = new HoodieDefaultTimeline(Stream.of(instant6), metaClient.getActiveTimeline()::getInstantDetails); +assertFalse(timeline.empty()); +assertFalse(timeline.containsInstant(newTimestamp)); +assertFalse(timeline.containsInstant(checkInstantTimestamp)); +assertTrue(timeline.containsInstant(instant6.getTimestamp())); } @Test diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java index ec7c9633576..ecf7c938b01 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java @@ -440,7 +440,7 @@ public class HoodieDefaultTimeline implements HoodieTimeline { // Check for older timestamp which have sec granularity and an extension of DEFAULT_MILLIS_EXT may have been added via Timeline operations if (ts.length() == HoodieInstantTimeGenerator.MILLIS_INSTANT_TIMESTAMP_FORMAT_LENGTH && ts.endsWith(HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT)) { final String actualOlderFormatTs = ts.substring(0, ts.length() - HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT.length()); - return containsOrBeforeTimelineStarts(actualOlderFormatTs); + return containsInstant(actualOlderFormatTs); } return false;
Re: [PR] [HUDI-7030] update containsInstant without containsOrBeforeTimelineStarts to fix data lost [hudi]
danny0405 merged PR #9982: URL: https://github.com/apache/hudi/pull/9982 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7030] update containsInstant without containsOrBeforeTimelineStarts to fix data lost [hudi]
danny0405 commented on PR #9982: URL: https://github.com/apache/hudi/pull/9982#issuecomment-1801183877 Tests have passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=20704&view=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]
hudi-bot commented on PR #9998: URL: https://github.com/apache/hudi/pull/9998#issuecomment-1801183250 ## CI report: * 420bf60614d20a1caf77bb6616e5fb8d7420b89e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20707) * 9cd41c8ef03048bb724990ecf93c9db3b5883734 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7032] ShowProcedures show add limit syntax to keep the same [hudi]
xuzifu666 commented on code in PR #9988: URL: https://github.com/apache/hudi/pull/9988#discussion_r1386071295 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala: ## @@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with ProcedureBuilder { val commits: util.List[HoodieInstant] = timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant]) if (commits.isEmpty) Seq.empty[Row] else { - commits.toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq + if (limit.isDefined) { + commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq Review Comment: try to construnct a 'limit and collect' method in parent class,but had 2 problem to face: 1. parameter can be list or rdd,cannot keep it the same; 2. some list need handle singe logic,some not need,this cause to need implement it in subclass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7032] ShowProcedures show add limit syntax to keep the same [hudi]
xuzifu666 commented on code in PR #9988: URL: https://github.com/apache/hudi/pull/9988#discussion_r1386071295 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala: ## @@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with ProcedureBuilder { val commits: util.List[HoodieInstant] = timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant]) if (commits.isEmpty) Seq.empty[Row] else { - commits.toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq + if (limit.isDefined) { + commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq Review Comment: try to construnct a 'limit and collect' method in parent class,but had 2 problem to face: ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala: ## @@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with ProcedureBuilder { val commits: util.List[HoodieInstant] = timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant]) if (commits.isEmpty) Seq.empty[Row] else { - commits.toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq + if (limit.isDefined) { + commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq Review Comment: try to construnct a 'limit and collect' method in parent class,but had 2 problem to face: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [BUG] Spark will read invalid timestamp(3) data when record in log is older than the same in parquet. [hudi]
seekforshell opened a new issue, #10012: URL: https://github.com/apache/hudi/issues/10012 Describe the problem you faced Spark read invalid timestamp(3) data when record in log is older than the same in parquet. To Reproduce 1. create a mor table with timestamp(3) type. eg. CREATE EXTERNAL TABLE `xxx.bucket_mor_t2`( `_hoodie_commit_time` string COMMENT '', `_hoodie_commit_seqno` string COMMENT '', `_hoodie_record_key` string COMMENT '', `_hoodie_partition_path` string COMMENT '', `_hoodie_file_name` string COMMENT '', `source_from` int COMMENT '', `id` bigint COMMENT '', `name` string COMMENT '', `create_time` timestamp COMMENT '', `price` decimal(14,2) COMMENT '', `extend` string COMMENT '', `count` bigint COMMENT '', `create_date` date COMMENT '', `ext_dt` timestamp COMMENT '', `precombine_field` string COMMENT '', `sync_deleted` int COMMENT '', `sync_time` timestamp COMMENT '', `__binlog_file` string COMMENT '', `__pos` int COMMENT '', `source_sys` int COMMENT '') PARTITIONED BY ( `__partition_field` int COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ( 'hoodie.query.as.ro.table'='false', 'path'='hdfs://NameNodeService1/xxx/xxx/bucket_mor_t2') STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://NameNodeService1/xxx/xxx/bucket_mor_t2' TBLPROPERTIES ( 'connector'='hudi', 'hoodie.datasource.write.recordkey.field'='source_from,id', 'last_commit_time_sync'='20231106172508127', 'path'='hdfs://NameNodeService1/xxx/xxx/bucket_mor_t2', 'spark.sql.sources.provider'='hudi', 'spark.sql.sources.schema.numPartCols'='1', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"source_from","type":"integer","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"create_time","type":"timestamp","nullable":true,"metadata":{}},{"name":"price","type":"decimal(14,2)","nullable":true,"metadata":{}},{"name":"extend","type":"string","nullable":true,"metadata":{}},{"name":"count","type":"long","nullable":true,"metadata":{}},{"name":"create_date","type":"date","nullable":true,"metadata":{}},{"name":"ext_dt","ty pe":"timestamp","nullable":true,"metadata":{}},{"name":"precombine_field","type":"string","nullable":true,"metadata":{}},{"name":"sync_deleted","type":"integer","nullable":true,"metadata":{}},{"name":"sync_time","type":"timestamp","nullable":true,"metadata":{}},{"name":"__binlog_file","type":"string","nullable":true,"metadata":{}},{"name":"__pos","type":"integer","nullable":true,"metadata":{}},{"name":"source_sys","type":"integer","nullable":true,"metadata":{}},{"name":"__partition_field","type":"integer","nullable":true,"metadata":{}}]}', 'spark.sql.sources.schema.partCol.0'='__partition_field', 'table.type'='MERGE_ON_READ', 'transient_lastDdlTime'='1692251328') 2. insert new data into parquet with flink engine. eg. insert a record(id=1) with precombine value = 013088002803892750 3. mock binlog(same record in step2) with precombine value = 1 (which is smaller than before) and commit but don't do compaction finally, read record(id=1) in snapthot mode with spark sql. invalid data will occur: ![b6c3e286dd36ef29f47f6ec569983e82](https://github.com/apache/hudi/assets/8132965/06d3a4b5-ae06-4387-9b2a-0e6b12127e2a) Expected behavior when
Re: [PR] [HUDI-7046] Fix partial merging logic based on projected reader schema [hudi]
hudi-bot commented on PR #10011: URL: https://github.com/apache/hudi/pull/10011#issuecomment-1801175482 ## CI report: * 1e96f587385fb7969f2f24946dcec50f9533dee8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20727) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
hudi-bot commented on PR #10010: URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801175442 ## CI report: * 8f048d83427375dcc856ef78872a3d8247c9390f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20723) * aa9fd9357a2398f3d35e9e3bb71cd9bee4be8432 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20726) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]
codope commented on code in PR #10009: URL: https://github.com/apache/hudi/pull/10009#discussion_r1386050749 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieIncrementalFileIndex.scala: ## @@ -36,9 +36,10 @@ class HoodieIncrementalFileIndex(override val spark: SparkSession, override val schemaSpec: Option[StructType], override val options: Map[String, String], @transient override val fileStatusCache: FileStatusCache = NoopCache, - override val includeLogFiles: Boolean) + override val includeLogFiles: Boolean, + override val shouldEmbedFileSlices: Boolean) Review Comment: do we have some bootstrap tests covering the path w/ and w/o `shouldEmbedFileSlices`? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieIncrementalFileIndex.scala: ## @@ -47,52 +48,52 @@ class HoodieIncrementalFileIndex(override val spark: SparkSession, val fileSlices = mergeOnReadIncrementalRelation.listFileSplits(partitionFilters, dataFilters) if (fileSlices.isEmpty) { Seq.empty -} - -val prunedPartitionsAndFilteredFileSlices = fileSlices.map { - case (partitionValues, fileSlices) => -if (shouldEmbedFileSlices) { - val baseFileStatusesAndLogFileOnly: Seq[FileStatus] = fileSlices.map(slice => { -if (slice.getBaseFile.isPresent) { - slice.getBaseFile.get().getFileStatus -} else if (slice.getLogFiles.findAny().isPresent) { - slice.getLogFiles.findAny().get().getFileStatus +} else { + val prunedPartitionsAndFilteredFileSlices = fileSlices.map { +case (partitionValues, fileSlices) => + if (shouldEmbedFileSlices) { Review Comment: I see some opportunity to reuse code here and in `HoodieFileIndex.listFiles`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]insert_overwrite mode writing 2 times more duplicates [hudi]
ad1happy2go commented on issue #9992: URL: https://github.com/apache/hudi/issues/9992#issuecomment-1801165966 @rishabhreply Did this resolve your doubts. Let us know if you need any more help. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7046] Fix partial merging logic based on projected reader schema [hudi]
hudi-bot commented on PR #10011: URL: https://github.com/apache/hudi/pull/10011#issuecomment-1801164803 ## CI report: * 1e96f587385fb7969f2f24946dcec50f9533dee8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
hudi-bot commented on PR #10010: URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801164759 ## CI report: * 8f048d83427375dcc856ef78872a3d8247c9390f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20723) * aa9fd9357a2398f3d35e9e3bb71cd9bee4be8432 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]
codope commented on code in PR #10007: URL: https://github.com/apache/hudi/pull/10007#discussion_r1385970457 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala: ## @@ -74,11 +76,25 @@ class NewHoodieParquetFileFormat(tableState: Broadcast[HoodieTableState], override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { if (!supportBatchCalled) { supportBatchCalled = true - supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema) + supportBatchResult = !isMOR && legacyFF.supportBatch(sparkSession, schema) } supportBatchResult } + private def wrapWithBatchConverter(reader: PartitionedFile => Iterator[InternalRow]): PartitionedFile => Iterator[InternalRow] = { Review Comment: why is this needed? i think the flatmap per row could incur some significant cost for a large batch. Instead of wrapping everytime, can it be guarded for some cases such as when schema on read is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
majian1998 commented on code in PR #10010: URL: https://github.com/apache/hudi/pull/10010#discussion_r1386042124 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsFileSystemReporter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metrics; + +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.util.JsonUtils; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.config.HoodieWriteConfig; + +import com.codahale.metrics.MetricRegistry; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.Map; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +public class MetricsFileSystemReporter extends MetricsReporter { + + private static final Logger LOG = LoggerFactory.getLogger(MetricsFileSystemReporter.class); + private MetricRegistry metricRegistry; + private SerializableConfiguration hadoopConf; + private String metricsPath; + private HoodieWriteConfig config; + private FileSystem fs; + private ScheduledExecutorService executor; + private static final String META_FOLDER_NAME = "/.hoodie"; + private static final String METRICS_FOLDER_NAME = "/metrics"; + private static final String METRICS_FILE_NAME = "_metrics.json"; Review Comment: Currently, the issue of storing multiple versions of results has been considered, and an overwrite parameter has been reserved to control this. The initial idea is to add a timestamp + cleanup strategy. However, this would be quite complex, so it has been temporarily placed in the TODO list, and the feasibility of the file system reporter needs to be confirmed first. Regarding the current way of overwriting files, by default, the data will be written under the ".hoodie" directory of the table and only one file will be kept, so there will be no conflicts with tables of the same name. Additionally, the file name prefix is designed to handle the overwriting of results in different functionalities or table scenarios. There is certainly room for optimization in this aspect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (0cb77908357 -> b08874268fb)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 0cb77908357 [HUDI-7042] Fix new filegroup reader (#10003) add b08874268fb [MINOR] Fix tests that set precombine to nonexistent field (#10008) No new revisions were added by this update. Summary of changes: .../src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala | 3 ++- .../src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-)
Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]
yihua merged PR #10008: URL: https://github.com/apache/hudi/pull/10008 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7046) Fix partial merging logic based on projected schema
[ https://issues.apache.org/jira/browse/HUDI-7046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7046: - Labels: pull-request-available (was: ) > Fix partial merging logic based on projected schema > --- > > Key: HUDI-7046 > URL: https://issues.apache.org/jira/browse/HUDI-7046 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > When querying the table with multiple round of partial updates generating > multiple log files, the partial merging logic may fail or give wrong results > due to schema handling and merging logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7046] Fix partial merging logic based on projected reader schema [hudi]
yihua opened a new pull request, #10011: URL: https://github.com/apache/hudi/pull/10011 ### Change Logs This PR fixes the logic of merging partial updates with projected reader schema, i.e., the reader schema contains a subset of fields from the table schema based on the query. - When processing log records in `HoodieBaseFileGroupRecordBuffer#doProcessNextDataRecord`, the schema of the combined record is also updated in the metadata since the schema can change due to partial merging; - A bug of getting the field values from the older record in `SparkRecordMergingUtils#mergePartialRecords` is fixed. - The partial update tests in `TestPartialUpdateForMergeInto` are enhanced to cover partial merging logic. ### Impact Makes sure the partial merging logic is correct. ### Risk level low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6790: -- Status: Patch Available (was: In Progress) > Support incremental read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6790 > URL: https://issues.apache.org/jira/browse/HUDI-6790 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6790. - Resolution: Done > Support incremental read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6790 > URL: https://issues.apache.org/jira/browse/HUDI-6790 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7042) Fix filegroup reader
[ https://issues.apache.org/jira/browse/HUDI-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7042. - Resolution: Fixed > Fix filegroup reader > > > Key: HUDI-7042 > URL: https://issues.apache.org/jira/browse/HUDI-7042 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Fix following issues for the new filegroup reader: > - Handle nested schema > - Append partition values correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]
codope commented on code in PR #10007: URL: https://github.com/apache/hudi/pull/10007#discussion_r1385970457 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala: ## @@ -74,11 +76,25 @@ class NewHoodieParquetFileFormat(tableState: Broadcast[HoodieTableState], override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { if (!supportBatchCalled) { supportBatchCalled = true - supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema) + supportBatchResult = !isMOR && legacyFF.supportBatch(sparkSession, schema) } supportBatchResult } + private def wrapWithBatchConverter(reader: PartitionedFile => Iterator[InternalRow]): PartitionedFile => Iterator[InternalRow] = { Review Comment: why is this needed? i think the flatmap per row could incur some significant cost for a large batch. Instrad of wrapping, can it be guarded for some cases such as when schema on read is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
hudi-bot commented on PR #10010: URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801067235 ## CI report: * 8f048d83427375dcc856ef78872a3d8247c9390f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20723) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7042] Fix new filegroup reader [hudi]
codope merged PR #10003: URL: https://github.com/apache/hudi/pull/10003 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7042] Fix new filegroup reader (#10003)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0cb77908357 [HUDI-7042] Fix new filegroup reader (#10003) 0cb77908357 is described below commit 0cb7790835775c39b5cf71683b95f7618c6c95cc Author: Sagar Sumit AuthorDate: Wed Nov 8 09:58:28 2023 +0530 [HUDI-7042] Fix new filegroup reader (#10003) --- .../hudi/common/model/HoodieSparkRecord.java | 2 + .../read/HoodieBaseFileGroupRecordBuffer.java | 2 +- .../common/table/read/HoodieFileGroupReader.java | 2 +- .../table/read/HoodieFileGroupRecordBuffer.java| 2 +- ...odieFileGroupReaderBasedParquetFileFormat.scala | 69 +++--- .../hudi/functional/TestMORDataSourceStorage.scala | 16 +++-- .../functional/TestPartialUpdateAvroPayload.scala | 23 +--- style/scalastyle.xml | 2 +- 8 files changed, 93 insertions(+), 25 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java index 3d59ad27257..5cb8800411c 100644 --- a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java +++ b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java @@ -40,6 +40,7 @@ import org.apache.spark.sql.catalyst.CatalystTypeConverters; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; import org.apache.spark.sql.catalyst.expressions.JoinedRow; +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow; import org.apache.spark.sql.catalyst.expressions.UnsafeProjection; import org.apache.spark.sql.catalyst.expressions.UnsafeRow; import org.apache.spark.sql.types.DataType; @@ -447,6 +448,7 @@ public class HoodieSparkRecord extends HoodieRecord { || schema != null && ( data instanceof HoodieInternalRow || data instanceof GenericInternalRow +|| data instanceof SpecificInternalRow || SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data)); ValidationUtils.checkState(isValid); diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java index 4a1bd08e4ef..90ebf71dfb1 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java @@ -80,7 +80,7 @@ public abstract class HoodieBaseFileGroupRecordBuffer implements HoodieFileGr } @Override - public void setBaseFileIteraotr(ClosableIterator baseFileIterator) { + public void setBaseFileIterator(ClosableIterator baseFileIterator) { this.baseFileIterator = baseFileIterator; } diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java index b655238412d..2850a77d709 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java @@ -154,7 +154,7 @@ public final class HoodieFileGroupReader implements Closeable { baseFilePath.get().getHadoopPath(), start, length, readerState.baseFileAvroSchema, readerState.baseFileAvroSchema, hadoopConf) : new EmptyIterator<>(); scanLogFiles(); -recordBuffer.setBaseFileIteraotr(baseFileIterator); +recordBuffer.setBaseFileIterator(baseFileIterator); } /** diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java index 680bbf9d705..0bf27cfc71e 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java @@ -100,7 +100,7 @@ public interface HoodieFileGroupRecordBuffer { * * @param baseFileIterator */ - void setBaseFileIteraotr(ClosableIterator baseFileIterator); + void setBaseFileIterator(ClosableIterator baseFileIterator); /** * Check if next merged record exists. diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/executi
Re: [PR] [HUDI-7042] Fix new filegroup reader [hudi]
codope commented on code in PR #10003: URL: https://github.com/apache/hudi/pull/10003#discussion_r1385964255 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -201,19 +208,65 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, length, shouldUseRecordPosition) reader.initRecordIterators() - reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala +// Append partition values to rows and project to output schema +appendPartitionAndProject( + reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala, + requiredSchemaWithMandatory, + partitionSchema, + outputSchema, + partitionValues) + } + + private def appendPartitionAndProject(iter: Iterator[InternalRow], +inputSchema: StructType, +partitionSchema: StructType, +to: StructType, +partitionValues: InternalRow): Iterator[InternalRow] = { +if (partitionSchema.isEmpty) { + projectSchema(iter, inputSchema, to) Review Comment: no not really.. `HoodieCatalystExpressionUtils.generateUnsafeProjection` checks that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
hudi-bot commented on PR #10010: URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801060357 ## CI report: * 8f048d83427375dcc856ef78872a3d8247c9390f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1801060265 ## CI report: * 4dcb8f7bea46847202c2444e1a99901484239f4f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20718) * bb60d3f2fe5737fc43a700bcc6c37806fe48868a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20722) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix npe for get internal schema [hudi]
hudi-bot commented on PR #9984: URL: https://github.com/apache/hudi/pull/9984#issuecomment-1801060193 ## CI report: * 23eb3d5bd578bffbd1165f7e178f391ce0056cb9 UNKNOWN * 2fb3eb51c728c5d3a9bdd725e77006a5141cc36f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20703) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Can not extract Partition Path with conf populateMetaFields set false and dropPartitionColumns set true [hudi]
zyl891229 commented on issue #9991: URL: https://github.com/apache/hudi/issues/9991#issuecomment-1801056154 > @zyl891229 Yes you are right, there is a issue with bulk_insert operation type along with combination of these two things. Although upsert/insert is running fine, you may use that. I confirmed both cases are failing , when using one partition col or two partition cols > > JIRA to track - https://issues.apache.org/jira/browse/HUDI-7040 > > Reproducible code - > > ``` > spark = get_spark_session(spark_version="3.2", hudi_version="0.14.0") > > insert_df = get_insert_df(spark, 10) > > hudi_configs = { > "hoodie.table.name": TABLE_NAME, > "hoodie.datasource.write.recordkey.field": "UUID", > "hoodie.datasource.write.precombine.field": "Name", > "hoodie.datasource.write.partitionpath.field": "Company", > "hoodie.datasource.write.operation": "bulk_insert", > "hoodie.datasource.write.hive_style_partitioning": "true", > "hoodie.populate.meta.fields": "false", > "hoodie.datasource.write.drop.partition.columns": "true" > } > > insert_df.write.format("hudi").mode("append").options(**hudi_configs).save(PATH) > ``` > @zyl891229 Yes you are right, there is a issue with bulk_insert operation type along with combination of these two things. Although upsert/insert is running fine, you may use that. I confirmed both cases are failing , when using one partition col or two partition cols > > JIRA to track - https://issues.apache.org/jira/browse/HUDI-7040 > > Reproducible code - > > ``` > spark = get_spark_session(spark_version="3.2", hudi_version="0.14.0") > > insert_df = get_insert_df(spark, 10) > > hudi_configs = { > "hoodie.table.name": TABLE_NAME, > "hoodie.datasource.write.recordkey.field": "UUID", > "hoodie.datasource.write.precombine.field": "Name", > "hoodie.datasource.write.partitionpath.field": "Company", > "hoodie.datasource.write.operation": "bulk_insert", > "hoodie.datasource.write.hive_style_partitioning": "true", > "hoodie.populate.meta.fields": "false", > "hoodie.datasource.write.drop.partition.columns": "true" > } > > insert_df.write.format("hudi").mode("append").options(**hudi_configs).save(PATH) > ``` Thank you for your reply. Is there any idea or method can provide? I will first modify this problem in fork. We need to delete useless columns to minimize storage space and save cost -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]
hudi-bot commented on PR #10008: URL: https://github.com/apache/hudi/pull/10008#issuecomment-1801053296 ## CI report: * b781cdd3f8a6ac42ff96eacd7c1ec4c132106dd9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20715) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1801053081 ## CI report: * d10a137bff419d0d4befb5dac8380ac0bf0f12f8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20705) * 4dcb8f7bea46847202c2444e1a99901484239f4f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20718) * bb60d3f2fe5737fc43a700bcc6c37806fe48868a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix npe for get internal schema [hudi]
hudi-bot commented on PR #9984: URL: https://github.com/apache/hudi/pull/9984#issuecomment-1801052829 ## CI report: * 23eb3d5bd578bffbd1165f7e178f391ce0056cb9 UNKNOWN * 2fb3eb51c728c5d3a9bdd725e77006a5141cc36f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20703) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix npe for get internal schema [hudi]
watermelon12138 commented on PR #9984: URL: https://github.com/apache/hudi/pull/9984#issuecomment-1800970902 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6508] Fix compile errors with JDK11 [hudi]
bvaradar commented on PR #9300: URL: https://github.com/apache/hudi/pull/9300#issuecomment-1800965782 @Zouxxyy : Can you fix the merge conflicts ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
stream2000 commented on code in PR #10010: URL: https://github.com/apache/hudi/pull/10010#discussion_r1385925628 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsFileSystemReporter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metrics; + +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.util.JsonUtils; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.config.HoodieWriteConfig; + +import com.codahale.metrics.MetricRegistry; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.Map; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +public class MetricsFileSystemReporter extends MetricsReporter { + + private static final Logger LOG = LoggerFactory.getLogger(MetricsFileSystemReporter.class); + private MetricRegistry metricRegistry; + private SerializableConfiguration hadoopConf; + private String metricsPath; + private HoodieWriteConfig config; + private FileSystem fs; + private ScheduledExecutorService executor; + private static final String META_FOLDER_NAME = "/.hoodie"; + private static final String METRICS_FOLDER_NAME = "/metrics"; + private static final String METRICS_FILE_NAME = "_metrics.json"; Review Comment: Who is gonna to delete the metrcs file? If we report the metrics multiple times for the same table will they overwrite each other? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
stream2000 commented on code in PR #10010: URL: https://github.com/apache/hudi/pull/10010#discussion_r1385925628 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsFileSystemReporter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metrics; + +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.util.JsonUtils; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.config.HoodieWriteConfig; + +import com.codahale.metrics.MetricRegistry; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.Map; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +public class MetricsFileSystemReporter extends MetricsReporter { + + private static final Logger LOG = LoggerFactory.getLogger(MetricsFileSystemReporter.class); + private MetricRegistry metricRegistry; + private SerializableConfiguration hadoopConf; + private String metricsPath; + private HoodieWriteConfig config; + private FileSystem fs; + private ScheduledExecutorService executor; + private static final String META_FOLDER_NAME = "/.hoodie"; + private static final String METRICS_FOLDER_NAME = "/metrics"; + private static final String METRICS_FILE_NAME = "_metrics.json"; Review Comment: Who is gonna to delete the metrcs file? If we report the metrics multiple times, for the same table will them overwrite each other? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix npe for get internal schema [hudi]
watermelon12138 commented on code in PR #9984: URL: https://github.com/apache/hudi/pull/9984#discussion_r1385921027 ## hudi-common/src/main/java/org/apache/hudi/common/util/InternalSchemaCache.java: ## @@ -217,7 +217,11 @@ public static InternalSchema getInternalSchemaByVersionId(long versionId, String } InternalSchema fileSchema = InternalSchemaUtils.searchSchema(versionId, SerDeHelper.parseSchemas(latestHistorySchema)); // step3: -return fileSchema.isEmptySchema() ? AvroInternalSchemaConverter.convert(HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(avroSchema))) : fileSchema; +return fileSchema.isEmptySchema() +? StringUtils.isNullOrEmpty(avroSchema) + ? InternalSchema.getEmptyInternalSchema() Review Comment: @danny0405 Yes, Some users find this problem in the upgrade scenario(0.12.3 -> 0.14). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7049) Implement File System-based Metrics Reporter
[ https://issues.apache.org/jira/browse/HUDI-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7049: - Labels: pull-request-available (was: ) > Implement File System-based Metrics Reporter > > > Key: HUDI-7049 > URL: https://issues.apache.org/jira/browse/HUDI-7049 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > > In addition to real-time monitoring metrics, Hudi also has some result > metrics, such as IO for clustering reads and writes. These metrics are > meaningful for continuously observing the table service status. > However, the existing metrics reporter either outputs to the console or > memory without persistence, or it outputs to another metrics server, > requiring complex environment setup. We hope to provide a simple persistent > reporter where users can specify that the metrics be stored in the file > system in JSON format. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]
majian1998 opened a new pull request, #10010: URL: https://github.com/apache/hudi/pull/10010 In addition to real-time monitoring metrics, Hudi also has some result metrics, such as IO for clustering reads and writes. These metrics are meaningful for continuously observing the table service status. However, the existing metrics reporter either outputs to the console or memory without persistence, or it outputs to another metrics server, requiring complex environment setup. We hope to provide a simple persistent reporter where users can specify that the metrics be stored in the file system in JSON format. Ideally, we planned to update the latest version of metrics to the file system by calling shutdown through a shutdown hook when finishing. However, at that point, the Hudi file system has already closed the connection pool, making it impossible to write to the file. Therefore, we update the file by actively calling the shutdown function when finishing. Currently, in HoodieSparkSqlWriter.cleanup(), the shutdown function is actively called, which means metrics are reported at the end of the write process. By doing the same in the table service, we can achieve the same effect. ### Change Logs Provides a file system-based metrics reporter ### Impact Some parameters related to the reporter: For example, in hoodie.metrics.reporter.type, FILESYSTEM has been added. And FILESYSTEM specifies the address, naming, and whether to enable scheduled writing of the metrics file. ### Risk level (write none, low medium or high below) LOW ### Documentation Update Metrics report type supports FILESYSTEM Updated parameters: hoodie.metrics.reporter.type, FILESYSTEM has been added. New parameters: hoodie.metrics.filesystem.reporter.path - The path for persisting Hudi storage metrics files. hoodie.metrics.filesystem.metric.prefix - The prefix for Hudi storage metrics persistence file names. hoodie.metrics.filesystem.overwrite.file - Whether to override the same metrics file for the same table. hoodie.metrics.filesystem.schedule.enable - Whether to enable scheduled output of metrics to the file system. Default is off, only need to output the final result to the file system. hoodie.metrics.filesystem.report.period.seconds - File system reporting period in seconds. Default to 60. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]
hudi-bot commented on PR #9966: URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800950656 ## CI report: * fa23cc909cdb9a3381c6646b3446ad44bd7b66d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20607) * 0a938b13fb76cbba8efce7bfc8edd5927094db67 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20721) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]
hudi-bot commented on PR #9966: URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800945396 ## CI report: * fa23cc909cdb9a3381c6646b3446ad44bd7b66d0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20607) * 0a938b13fb76cbba8efce7bfc8edd5927094db67 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7038] RunCompactionProcedure support limit parameter [hudi]
ksmou commented on PR #: URL: https://github.com/apache/hudi/pull/#issuecomment-1800945369 > @ksmou Can you also update the website about this new param? got -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7033] Fix read error for schema evolution + partition value extraction (#9994)
This is an automated email from the ASF dual-hosted git repository. vbalaji pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new a4fa3451916 [HUDI-7033] Fix read error for schema evolution + partition value extraction (#9994) a4fa3451916 is described below commit a4fa3451916de11dc082792076b62013586dadaf Author: voonhous AuthorDate: Wed Nov 8 10:49:48 2023 +0800 [HUDI-7033] Fix read error for schema evolution + partition value extraction (#9994) --- .../org/apache/hudi/HoodieDataSourceHelper.scala | 61 +- .../apache/hudi/TestHoodieDataSourceHelper.scala | 54 +++ .../org/apache/spark/sql/hudi/TestSpark3DDL.scala | 41 +++ 3 files changed, 154 insertions(+), 2 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala index eb8ddfdf870..4add21b5b8d 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala @@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.PredicateHelper import org.apache.spark.sql.execution.datasources.PartitionedFile import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat -import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.sources.{And, Filter, Or} import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch @@ -58,7 +58,7 @@ object HoodieDataSourceHelper extends PredicateHelper with SparkAdapterSupport { dataSchema = dataSchema, partitionSchema = partitionSchema, requiredSchema = requiredSchema, - filters = filters, + filters = if (appendPartitionValues) getNonPartitionFilters(filters, dataSchema, partitionSchema) else filters, options = options, hadoopConf = hadoopConf ) @@ -98,4 +98,61 @@ object HoodieDataSourceHelper extends PredicateHelper with SparkAdapterSupport { deserializer.deserialize(avroRecord).get.asInstanceOf[InternalRow] } } + + def getNonPartitionFilters(filters: Seq[Filter], dataSchema: StructType, partitionSchema: StructType): Seq[Filter] = { +filters.flatMap(f => { + if (f.references.intersect(partitionSchema.fields.map(_.name)).nonEmpty) { +extractPredicatesWithinOutputSet(f, dataSchema.fieldNames.toSet) + } else { +Some(f) + } +}) + } + + /** + * Heavily adapted from {@see org.apache.spark.sql.catalyst.expressions.PredicateHelper#extractPredicatesWithinOutputSet} + * Method is adapted to work with Filters instead of Expressions + * + * @return + */ + def extractPredicatesWithinOutputSet(condition: Filter, + outputSet: Set[String]): Option[Filter] = condition match { +case And(left, right) => + val leftResultOptional = extractPredicatesWithinOutputSet(left, outputSet) + val rightResultOptional = extractPredicatesWithinOutputSet(right, outputSet) + (leftResultOptional, rightResultOptional) match { +case (Some(leftResult), Some(rightResult)) => Some(And(leftResult, rightResult)) +case (Some(leftResult), None) => Some(leftResult) +case (None, Some(rightResult)) => Some(rightResult) +case _ => None + } + +// The Or predicate is convertible when both of its children can be pushed down. +// That is to say, if one/both of the children can be partially pushed down, the Or +// predicate can be partially pushed down as well. +// +// Here is an example used to explain the reason. +// Let's say we have +// condition: (a1 AND a2) OR (b1 AND b2), +// outputSet: AttributeSet(a1, b1) +// a1 and b1 is convertible, while a2 and b2 is not. +// The predicate can be converted as +// (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2) +// As per the logical in And predicate, we can push down (a1 OR b1). +case Or(left, right) => + for { +lhs <- extractPredicatesWithinOutputSet(left, outputSet) +rhs <- extractPredicatesWithinOutputSet(right, outputSet) + } yield Or(lhs, rhs) + +// Here we assume all the `Not` operators is already below all the `And` and `Or` operators +// after the optimization rule `BooleanSimplification`, so that we don't need to handle the +// `Not` operators here. +case other => + if (other.references.toSet.subsetOf(outputSet)) { +Some(other) + } else { +None + } + } } diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieDataSour
Re: [PR] [HUDI-7033] Fix read error for schema evolution + partition value ext… [hudi]
bvaradar merged PR #9994: URL: https://github.com/apache/hudi/pull/9994 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7049) Implement File System-based Metrics Reporter
Ma Jian created HUDI-7049: - Summary: Implement File System-based Metrics Reporter Key: HUDI-7049 URL: https://issues.apache.org/jira/browse/HUDI-7049 Project: Apache Hudi Issue Type: New Feature Reporter: Ma Jian In addition to real-time monitoring metrics, Hudi also has some result metrics, such as IO for clustering reads and writes. These metrics are meaningful for continuously observing the table service status. However, the existing metrics reporter either outputs to the console or memory without persistence, or it outputs to another metrics server, requiring complex environment setup. We hope to provide a simple persistent reporter where users can specify that the metrics be stored in the file system in JSON format. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7036) Reduce the driver memory pressure during buildProfile
[ https://issues.apache.org/jira/browse/HUDI-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu closed HUDI-7036. - Resolution: Fixed > Reduce the driver memory pressure during buildProfile > - > > Key: HUDI-7036 > URL: https://issues.apache.org/jira/browse/HUDI-7036 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > > The record distribution should be based on (partition_path, instant_time, > file_id), instead of (partition_path, instant_time, file_id, position). -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]
hudi-bot commented on PR #10009: URL: https://github.com/apache/hudi/pull/10009#issuecomment-1800913449 ## CI report: * 8242310dbf950c9ece07c4b4fe5593e70e0bedf4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20719) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1800913385 ## CI report: * d10a137bff419d0d4befb5dac8380ac0bf0f12f8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20705) * 4dcb8f7bea46847202c2444e1a99901484239f4f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20718) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6993] Support Flink 1.18 [hudi]
danny0405 commented on code in PR #9949: URL: https://github.com/apache/hudi/pull/9949#discussion_r1385884858 ## packaging/bundle-validation/ci_run.sh: ## @@ -162,6 +162,8 @@ else HUDI_FLINK_BUNDLE_NAME=hudi-flink1.16-bundle elif [[ ${FLINK_PROFILE} == 'flink1.17' ]]; then HUDI_FLINK_BUNDLE_NAME=hudi-flink1.17-bundle + elif [[ ${FLINK_PROFILE} == 'flink1.18' ]]; then +HUDI_FLINK_BUNDLE_NAME=hudi-flink1.18-bundle Review Comment: The `IMAGE_TAG` should be updated as flink 1.18 after we uploaded the docker image. cc @codope for the help ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]
voonhous commented on PR #9966: URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800910433 Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]
hudi-bot commented on PR #10009: URL: https://github.com/apache/hudi/pull/10009#issuecomment-1800908265 ## CI report: * 8242310dbf950c9ece07c4b4fe5593e70e0bedf4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1800908197 ## CI report: * d10a137bff419d0d4befb5dac8380ac0bf0f12f8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20705) * 4dcb8f7bea46847202c2444e1a99901484239f4f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]
hudi-bot commented on PR #10007: URL: https://github.com/apache/hudi/pull/10007#issuecomment-1800900579 ## CI report: * 93c52c5602738f2d39dd5942d5e5cde940f843f3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20713) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7042] Fix new filegroup reader [hudi]
danny0405 commented on code in PR #10003: URL: https://github.com/apache/hudi/pull/10003#discussion_r1385866309 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -201,19 +208,65 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, length, shouldUseRecordPosition) reader.initRecordIterators() - reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala +// Append partition values to rows and project to output schema +appendPartitionAndProject( + reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala, + requiredSchemaWithMandatory, + partitionSchema, + outputSchema, + partitionValues) + } + + private def appendPartitionAndProject(iter: Iterator[InternalRow], +inputSchema: StructType, +partitionSchema: StructType, +to: StructType, +partitionValues: InternalRow): Iterator[InternalRow] = { +if (partitionSchema.isEmpty) { + projectSchema(iter, inputSchema, to) Review Comment: Do we still project the rows if the `iter` is already in the required output schema? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7048) Fix checkpoint loss issue when changing MOR to COW in streamer
[ https://issues.apache.org/jira/browse/HUDI-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7048: - Labels: pull-request-available (was: ) > Fix checkpoint loss issue when changing MOR to COW in streamer > -- > > Key: HUDI-7048 > URL: https://issues.apache.org/jira/browse/HUDI-7048 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7048] Fix checkpoint loss issue when changing MOR to COW in streamer [hudi]
danny0405 merged PR #10001: URL: https://github.com/apache/hudi/pull/10001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7048) Fix checkpoint loss issue when changing MOR to COW in streamer
[ https://issues.apache.org/jira/browse/HUDI-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7048. Resolution: Fixed Fixed via master branch: eeec775f3803cf231f041196aa11ca1c83228ea8 > Fix checkpoint loss issue when changing MOR to COW in streamer > -- > > Key: HUDI-7048 > URL: https://issues.apache.org/jira/browse/HUDI-7048 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated (fe554d89460 -> eeec775f380)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from fe554d89460 [HUDI-7038] RunCompactionProcedure support limit parameter (#) add eeec775f380 [HUDI-7048] Fix checkpoint loss issue when changing MOR to COW in streamer (#10001) No new revisions were added by this update. Summary of changes: .../apache/hudi/utilities/streamer/StreamSync.java | 5 +- .../deltastreamer/TestHoodieDeltaStreamer.java | 68 ++ 2 files changed, 71 insertions(+), 2 deletions(-)
[jira] [Created] (HUDI-7048) Fix checkpoint loss issue when changing MOR to COW in streamer
Danny Chen created HUDI-7048: Summary: Fix checkpoint loss issue when changing MOR to COW in streamer Key: HUDI-7048 URL: https://issues.apache.org/jira/browse/HUDI-7048 Project: Apache Hudi Issue Type: Improvement Components: deltastreamer Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7038) RunCompactionProcedure support limit parameter
[ https://issues.apache.org/jira/browse/HUDI-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7038. Resolution: Fixed Fixed via master branch: fe554d894601e20e95b10dca86bbe1ee71df4856 > RunCompactionProcedure support limit parameter > -- > > Key: HUDI-7038 > URL: https://issues.apache.org/jira/browse/HUDI-7038 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction >Reporter: kwang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated (f33459b2ea2 -> fe554d89460)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from f33459b2ea2 [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details (#1) add fe554d89460 [HUDI-7038] RunCompactionProcedure support limit parameter (#) No new revisions were added by this update. Summary of changes: .../procedures/RunCompactionProcedure.scala| 6 ++- .../hudi/procedure/TestCompactionProcedure.scala | 47 ++ 2 files changed, 51 insertions(+), 2 deletions(-)
Re: [PR] [HUDI-7038] RunCompactionProcedure support limit parameter [hudi]
danny0405 commented on PR #: URL: https://github.com/apache/hudi/pull/#issuecomment-1800878801 @ksmou Can you also update the website about this new param? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7038] RunCompactionProcedure support limit parameter [hudi]
danny0405 merged PR #: URL: https://github.com/apache/hudi/pull/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7038) RunCompactionProcedure support limit parameter
[ https://issues.apache.org/jira/browse/HUDI-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7038: - Fix Version/s: 1.0.0 > RunCompactionProcedure support limit parameter > -- > > Key: HUDI-7038 > URL: https://issues.apache.org/jira/browse/HUDI-7038 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction >Reporter: kwang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]
danny0405 commented on PR #9966: URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800876294 @voonhous Can you rebase with the latest master to resolve the test failures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7039) PartialUpdateAvroPayload preCombine failed need show details
[ https://issues.apache.org/jira/browse/HUDI-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7039: - Fix Version/s: 1.0.0 > PartialUpdateAvroPayload preCombine failed need show details > > > Key: HUDI-7039 > URL: https://issues.apache.org/jira/browse/HUDI-7039 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > currently PartialUpdateAvroPayload preCombine would not show details even > when failed -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7039) PartialUpdateAvroPayload preCombine failed need show details
[ https://issues.apache.org/jira/browse/HUDI-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7039. Resolution: Fixed Fixed via master branch: f33459b2ea2ae240b49dcf94d8e7715f57c80c5d > PartialUpdateAvroPayload preCombine failed need show details > > > Key: HUDI-7039 > URL: https://issues.apache.org/jira/browse/HUDI-7039 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > currently PartialUpdateAvroPayload preCombine would not show details even > when failed -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details (#10000)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f33459b2ea2 [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details (#1) f33459b2ea2 is described below commit f33459b2ea2ae240b49dcf94d8e7715f57c80c5d Author: xuzifu666 AuthorDate: Wed Nov 8 09:50:03 2023 +0800 [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details (#1) Co-authored-by: xuyu <11161...@vivo.com> --- .../java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java | 5 + 1 file changed, 5 insertions(+) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java b/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java index 27e744c4925..91b66e004e5 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java @@ -29,6 +29,8 @@ import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; import java.io.IOException; import java.util.List; @@ -117,6 +119,8 @@ import java.util.Properties; */ public class PartialUpdateAvroPayload extends OverwriteNonDefaultsWithLatestAvroPayload { + private static final Logger LOG = LoggerFactory.getLogger(PartialUpdateAvroPayload.class); + public PartialUpdateAvroPayload(GenericRecord record, Comparable orderingVal) { super(record, orderingVal); } @@ -141,6 +145,7 @@ public class PartialUpdateAvroPayload extends OverwriteNonDefaultsWithLatestAvro shouldPickOldRecord ? oldValue.orderingVal : this.orderingVal); } } catch (Exception ex) { + LOG.warn("PartialUpdateAvroPayload precombine failed with ", ex); return this; } return this;
Re: [PR] [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details [hudi]
danny0405 merged PR #1: URL: https://github.com/apache/hudi/pull/1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix npe for get internal schema [hudi]
danny0405 commented on code in PR #9984: URL: https://github.com/apache/hudi/pull/9984#discussion_r1385855432 ## hudi-common/src/main/java/org/apache/hudi/common/util/InternalSchemaCache.java: ## @@ -217,7 +217,11 @@ public static InternalSchema getInternalSchemaByVersionId(long versionId, String } InternalSchema fileSchema = InternalSchemaUtils.searchSchema(versionId, SerDeHelper.parseSchemas(latestHistorySchema)); // step3: -return fileSchema.isEmptySchema() ? AvroInternalSchemaConverter.convert(HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(avroSchema))) : fileSchema; +return fileSchema.isEmptySchema() +? StringUtils.isNullOrEmpty(avroSchema) + ? InternalSchema.getEmptyInternalSchema() Review Comment: Is it because the version upgrade or something? Is the null avro schema coming from an old version Hudi table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6217] Support handing '_hoodie_operation' meta field for Spark snapshot source [hudi]
danny0405 commented on PR #8721: URL: https://github.com/apache/hudi/pull/8721#issuecomment-1800869553 @beyond1920 You can just take it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (7ce62fc5793 -> 3d8e72a20fe)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 7ce62fc5793 [MINOR] Remove rocksdb version from m1 profile (#10006) add 3d8e72a20fe [HUDI-7010] Build clustering group reduces redundant traversals (#9957) No new revisions were added by this update. Summary of changes: .../PartitionAwareClusteringPlanStrategy.java | 5 ...TestSparkBuildClusteringGroupsForPartition.java | 30 ++ 2 files changed, 35 insertions(+)
[jira] [Closed] (HUDI-7010) Build clustering group reduces redundant traversals
[ https://issues.apache.org/jira/browse/HUDI-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7010. Resolution: Fixed Fixed via master branch: 3d8e72a20fe161839815bc8143b277c93b3c93eb > Build clustering group reduces redundant traversals > --- > > Key: HUDI-7010 > URL: https://issues.apache.org/jira/browse/HUDI-7010 > Project: Apache Hudi > Issue Type: Improvement > Components: clustering >Reporter: kwang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7010) Build clustering group reduces redundant traversals
[ https://issues.apache.org/jira/browse/HUDI-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7010: - Fix Version/s: 0.14.1 > Build clustering group reduces redundant traversals > --- > > Key: HUDI-7010 > URL: https://issues.apache.org/jira/browse/HUDI-7010 > Project: Apache Hudi > Issue Type: Improvement > Components: clustering >Reporter: kwang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7010] Build clustering group reduces redundant traversals [hudi]
danny0405 merged PR #9957: URL: https://github.com/apache/hudi/pull/9957 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7047) Fix incremental queries using new file format
[ https://issues.apache.org/jira/browse/HUDI-7047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7047: - Labels: pull-request-available (was: ) > Fix incremental queries using new file format > - > > Key: HUDI-7047 > URL: https://issues.apache.org/jira/browse/HUDI-7047 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/hudi/pull/9888] introduced some issues that cause > reads to fail in some tests when the new file format is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]
jonvex opened a new pull request, #10009: URL: https://github.com/apache/hudi/pull/10009 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7047) Fix incremental queries using new file format
Jonathan Vexler created HUDI-7047: - Summary: Fix incremental queries using new file format Key: HUDI-7047 URL: https://issues.apache.org/jira/browse/HUDI-7047 Project: Apache Hudi Issue Type: Bug Reporter: Jonathan Vexler Assignee: Jonathan Vexler [https://github.com/apache/hudi/pull/9888] introduced some issues that cause reads to fail in some tests when the new file format is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]
hudi-bot commented on PR #10008: URL: https://github.com/apache/hudi/pull/10008#issuecomment-1800857372 ## CI report: * b781cdd3f8a6ac42ff96eacd7c1ec4c132106dd9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20715) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]
hudi-bot commented on PR #10008: URL: https://github.com/apache/hudi/pull/10008#issuecomment-1800851584 ## CI report: * b781cdd3f8a6ac42ff96eacd7c1ec4c132106dd9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7046) Fix partial merging logic based on projected schema
[ https://issues.apache.org/jira/browse/HUDI-7046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7046: --- Assignee: Ethan Guo > Fix partial merging logic based on projected schema > --- > > Key: HUDI-7046 > URL: https://issues.apache.org/jira/browse/HUDI-7046 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > > When querying the table with multiple round of partial updates generating > multiple log files, the partial merging logic may fail or give wrong results > due to schema handling and merging logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)