[GitHub] [hudi] voonhous commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…
voonhous commented on PR #8755: URL: https://github.com/apache/hudi/pull/8755#issuecomment-1554103648 @danny0405 Can you please help to take a look at this PR again, i added more tests to the PR. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on a diff in pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…
voonhous commented on code in PR #8755: URL: https://github.com/apache/hudi/pull/8755#discussion_r1198612870 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/StatsFileSizeProcedure.scala: ## @@ -54,8 +55,22 @@ class StatsFileSizeProcedure extends BaseProcedure with ProcedureBuilder { val globRegex = getArgValueOrDefault(args, parameters(1)).get.asInstanceOf[String] val limit: Int = getArgValueOrDefault(args, parameters(2)).get.asInstanceOf[Int] val basePath = getBasePath(table) -val fs = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build.getFs -val globPath = String.format("%s/%s/*", basePath, globRegex) +val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build +val fs = metaClient.getFs +val isTablePartitioned = metaClient.getTableConfig.isTablePartitioned +val maximumPartitionDepth = if (isTablePartitioned) metaClient.getTableConfig.getPartitionFields.get.length else 0 +val globPath = (metaClient.getTableConfig.isTablePartitioned, globRegex) match { Review Comment: Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8759: Add metrics counters for compaction start/stop events.
hudi-bot commented on PR #8759: URL: https://github.com/apache/hudi/pull/8759#issuecomment-1554092885 ## CI report: * fbdd1d299bdf653c65f21c374e0aada9b768318f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17198) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan opened a new pull request, #8764: [HUDI-6240] Adding default value as CORRECTED for rebase modes in write and read for avro
nsivabalan opened a new pull request, #8764: URL: https://github.com/apache/hudi/pull/8764 ### Change Logs Adding default value as "CORRECTED" for rebase modes in write and read for avro, to be used when encountering timestamps older than 1970. ### Impact Will automatically work out of the box, unless user prefers to override them. ### Risk level (write none, low medium or high below) low. ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
svn commit: r61966 - /release/hudi/KEYS
Author: yihua Date: Fri May 19 06:41:55 2023 New Revision: 61966 Log: Add GPG key of zhangyue19921010 Modified: release/hudi/KEYS Modified: release/hudi/KEYS == --- release/hudi/KEYS (original) +++ release/hudi/KEYS Fri May 19 06:41:55 2023 @@ -1170,4 +1170,63 @@ bTekUhOFAo/Xl12LSY0Wv5c7YEWWgbFH9qfKg5sr =pbmq -END PGP PUBLIC KEY BLOCK- +pub rsa4096 2023-05-09 [SC] [expires: 2027-05-09] + FE450FF74ABF3594AF8920603A46066B5E92F6B2 +uid [ultimate] zhangyue19921010 +sig 33A46066B5E92F6B2 2023-05-09 zhangyue19921010 +sub rsa4096 2023-05-09 [E] [expires: 2027-05-09] +sig 3A46066B5E92F6B2 2023-05-09 zhangyue19921010 + +-BEGIN PGP PUBLIC KEY BLOCK- + +mQINBGRZ5SgBEADGZQ6Ro00rzJJCKNINKfDsl4a0Jam21q1pA8mtMyzx/rSjUSbt +UTqta5im8KgUdDtJAmPxzxF/97az/SpMHEEfT+csgd+xxHuFBNkpFpgEwIty9djC +NjHgJb7pk83YeBiAblN3aMovFkUx/PotTxlWvqq6vWEp09K0I9V4zE4aYdWlwizJ +/ZAVxDvqSSH2sDCBvk7bJC2lMn42+Bb8i/M/8C+9MXXZGOe8HQZsAt1B9HEbtOV4 +nVytMhVlnmKoVlbtzHV8BPfoPNc7sriT5vM1WcqZoxIFclK9x01m32QeyOxNO5fH +euh7etB+OFpG6yoOf/ml5sgfq/njpVaNrUtd43b/c/fpW9pXAkeXYvr7XpNWsBCr +wN9XevDzuYZTk6HdDxU8XIYOuCJrCEtOcZBdhrRb5m9t2KF67ZtxFof5W4MUBcFp +ow6IEAh46syqJDAqg3zRD7G+wB8kJuOpD9yqDqk74PJ0EFA5Ib+ngiPPYalaFb+i +wCtPuekzbq075H+D2PM9XPqKNmnJNuKg+sJhRmwLForyzG9zi/oUtG37DMvoxwcD +3k715BUh0475dvV3xqcjb1vPCAw/JPW7iX9lS+k8L+9Z0TZk9tvzv6gYMpqYwS4j +RRLuBjzV9Et1hAg5ZQNHw2AGhKaWeaWA3GJzOZl4v8+irjAFu1rQLtd5RQARAQAB +tC56aGFuZ3l1ZTE5OTIxMDEwIDx6aGFuZ3l1ZTE5OTIxMDEwQGFwYWNoZS5vcmc+ +iQJUBBMBCAA+FiEE/kUP90q/NZSviSBgOkYGa16S9rIFAmRZ5SgCGwMFCQeGHekF +CwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQOkYGa16S9rL5pQ//d9Rx71AdLMq7 +6tsPBQOpuF5IZTxDjU7iX3nC4V6/IKDBHwXJgaUA0NJlDk5IsegxPLnsVTXsVioe +u1hLljoLkYKEqkyKSSHG+RJEHgwrMXE1L4w9mIrZ/r4rnOcUfXEeIlgh+LhLN2wi +Uia9T9zsjP6yMWcAtkTZNdLx0hwf2qZ+gZgS13C6sMvGVT8lhqSKGXFiTA3pLya2 +Ambuxwf6EL4NqCxmt6qUQZDAqJjlPTpLHpNgPJtYl8i2l3h3S1L94MLgJL0IzFx9 +9g7PoicyvtstG4R44g1NE6N0kHfDkGQeqHAdDMrFrvIeGGTOst1PincoWA4SQPqy +RM5CdcU0+JlhlCdVTEkqP7UksWHVrRcsg/n5uFaJPNyfLkDe3d35we0qatchliQd +7wOM/ufTIBPmz0OjE2pU9wv9KuOdkIRkR1iROrYVH8kgymr4GI1xrVdyr+M3GPoz +VIUM7a6VWl8ZRW71WisiTE1z4i0WaRpvZ7HprOpvlzFpNO/4ZnOb53iQV3/XduH+ +LN8VuOOiFvhsEYVkRUazMvY5UjjuIL3gNpdeMArT8TdQxgyhINVnfH2iLq4F9ZGk +ZAO7jfE/HqGDzSXT9StnHok8RJGaB0+smboZHKvV3JduvYlSROBJhCkUKpJBgSzi +CFmzVkN345GQqvd0MOW4ejPKspoteL+5Ag0EZFnlKAEQANtA3VfPzrYFan3mbr54 ++3/7RW759w3Gb1ICVHB4aFv5QQI7+CUHn5zq346YmY2wcxm3QQfF2Prp0NsXLuHp +aMGlmalhNYUfiAjmBoagou+N3fraV88xwLN5bnYwT+/20/x3ZHHPMpMzphLteTK6 +HhE4kvez67IHpPkBNlKsz91Cl28BDsqN/F4oWcTHkTfI7hXiXJ5tx7t/BjvaAWhA +lEybEfTdJMu5jO968DwrDYBWxH991fC9kCsnu6T7TIn4oSId+Jp9MDTVCSsOBQON +67+dEh4tt4FHGFCHImg5lQJEii4no0l2jKAMEqptc7TlWwUkdgxlEG/VI33MFLY3 +svlsPpBhxqqM38ytNluc/tsmUIkJMfq9neH6IxGQOuc6FXtW0e/WUsHW15QO8wEO +xLU0Q/TpB9C5/ghL1N12teu9jCA8GYzwFjd9cgiBGEpTYcCRVik+K92LYJOj/NSg +fdJDcNPeetYbnRZUTe6wMlv320nuPy+KzIhWowSSUesBLxLUuCxqOACiomNUPYMN +g1xN7dgBeWKA3Pagu2iGEKcuuXC/r2UvwxPpZ5ceIF2dISBySdZa+NFjoR/aAUGY +XgLSKzmRSERaBbuiuaIn69Kvy5swfvi2GDrQAWDjraCEwUKHfG+JZAgH6mI6/RvI +O/eiShsC47+20Zbs+bQjrkbdABEBAAGJAjwEGAEIACYWIQT+RQ/3Sr81lK+JIGA6 +RgZrXpL2sgUCZFnlKAIbDAUJB4Yd6QAKCRA6RgZrXpL2shn2D/9KBgFECwjJes96 +u/7/Xymrc2SPw1nYaJHCn0KZmCb/3E106dDvQkscR7y5FAw8+/HkV4qjc4Cw1Ewg +XFOPr78XvHMDGwV54T5Qf8CFYq2qQhYkgTNFEwpWKt6uCQq9dtGhEn6to8lzNWD+ +IcfY+XV7uvZUP5DUbB6GhhpQ04YYComRT+QS2v6ERzrV9Yp8Qdlv6JeUjFJgi2zm +ON20SQy9Ami+tTOHheQ7yrCn+cc1nicAllZuYDf4anzQJqGw/aFqqdXYcna67eBn +mxkNoypZNgc0aLqaWrwqg21UKGmglHw516uFJTpzD/V9Xg6hI3rk80bYmNoHfH/s +SMxhkgIRSHYHVc83HlB3DvAzPSfWWtJKvrlXyyHTaIjXewnQdmF+gdCCcVaBHVZf +hgeP/Ah5rL9ig7c1Jbh4iSToFKfeYP0CTe0B4FwS2uRZnzlPevQGD2c+2eH+pmZp +mbR/av6r2QgjC0XjIPSJ/I0WZKmgeEO+c2ZEoktryWUFCA8kgH8kvIa/4LWtF8R7 +lVb3dhPPv/E2S0JUl83D2vXOeSiWV5uQexmIKjJR2i4/sgY4po0osgw22+Rnl3Uo +0oqAy4DlB31qKt9CVlrj7tNutSe3ZUYznm91e/EIWPsjOoqzV7KtTlzClOO544Jc +fljHpxb7KmeYv1gjxcM+kcLyZ/89Cw== +=rzoK +-END PGP PUBLIC KEY BLOCK-
[jira] [Created] (HUDI-6240) Add default values for rebase modes for avro to handle older timestamps
sivabalan narayanan created HUDI-6240: - Summary: Add default values for rebase modes for avro to handle older timestamps Key: HUDI-6240 URL: https://issues.apache.org/jira/browse/HUDI-6240 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: sivabalan narayanan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] eyjian commented on issue #8757: [SUPPORT] How to get a row of a primary key?
eyjian commented on issue #8757: URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554076777 > You have to use upsert only to use partial update. So with Spark sql you must use merge into or update as Insert will act as insert operationType for which hudi doesn't guarantee uniqueness. [By default, if preCombineKey is provided, insert into use upsert as the type of write operation, otherwise use insert](https://hudi.apache.org/cn/docs/quick-start-guide),event adding "hoodie.datasource.write.operation = 'upsert'" no effect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eyjian commented on issue #8757: [SUPPORT] How to get a row of a primary key?
eyjian commented on issue #8757: URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554065728 Create table: ```sql CREATE TABLE `test_db`.`t21` ( `_hoodie_commit_time` STRING, `_hoodie_commit_seqno` STRING, `_hoodie_record_key` STRING, `_hoodie_partition_path` STRING, `_hoodie_file_name` STRING, `ut` STRING, `pk` BIGINT, `f0` BIGINT, `f1` BIGINT, `f2` BIGINT, `f3` BIGINT, `f4` BIGINT, `ds` BIGINT) USING hudi PARTITIONED BY (ds) TBLPROPERTIES ( 'hoodie.bucket.index.num.buckets' = '2', 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.PartialUpdateAvroPayload', 'hoodie.index.type' = 'BUCKET', 'primaryKey' = 'pk', 'type' = 'mor', 'preCombineField' = 'ut', 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.PartialUpdateAvroPayload', 'hoodie.archive.merge.enable' = 'true'); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #8757: [SUPPORT] How to get a row of a primary key?
ad1happy2go commented on issue #8757: URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554065593 You have to use upsert only to use partial update. So with Spark sql you must use merge into or update as Insert will act as insert operationType for which hudi doesn't guarantee uniqueness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eyjian commented on issue #8757: [SUPPORT] How to get a row of a primary key?
eyjian commented on issue #8757: URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554061049 > Did you try update or merge into clause? Thank you. I will try to update and merge, but I need upsert to update a wide table. Each rows are in a different parquet file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8638: added new exception types
hudi-bot commented on PR #8638: URL: https://github.com/apache/hudi/pull/8638#issuecomment-1554050386 ## CI report: * c8cf2d86b1be30d3215b3b6e89b8bda33a1fe5dc UNKNOWN * 333d9faa53e71ba535a7cb8c60ce8b350a33452c UNKNOWN * aa35b5562c16840b5ebf143009beac2c291de2c9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17196) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink
hudi-bot commented on PR #8749: URL: https://github.com/apache/hudi/pull/8749#issuecomment-1554045388 ## CI report: * c8e2c682741b9364ed44c6c70cd3962404daa1e1 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17204) * 1958203e67af53e5deca919e91208388bfde257c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17208) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC
xushiyan commented on PR #7469: URL: https://github.com/apache/hudi/pull/7469#issuecomment-1554043487 @LinMingQiang would you rebase master and resolve conflict pls? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #8200: [MINOR] hoodie.datasource.write.row.writer.enable should set to be true.
xushiyan commented on PR #8200: URL: https://github.com/apache/hudi/pull/8200#issuecomment-1554024122 > > on, I got it, the default value in config is true. But I think it will not lead to the differences of sorting results > > You can test it,if the value is false , it will create a RDDCustomColumnsSortPartitioner who's class description is " A partitioner that does sorting based on specified column values for each RDD partition." Both RDDCustomColumnsSortPartitioner and RowCustomColumnsSortPartitioner should sort globally. If you observe sorting issue, then it's a different bug to be fixed. Flipping this default value here is irrelevant to sorting issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4370) Support JsonConverter in Kafka Connect sink
[ https://issues.apache.org/jira/browse/HUDI-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-4370: - Fix Version/s: 0.14.0 (was: 1.0.0) > Support JsonConverter in Kafka Connect sink > --- > > Key: HUDI-4370 > URL: https://issues.apache.org/jira/browse/HUDI-4370 > Project: Apache Hudi > Issue Type: New Feature > Components: kafka-connect >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently, "org.apache.kafka.connect.json.JsonConverter" is not supported. > We need to hook up the logic for converting the json String to Avro record > like StringConverter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4388) Strucutred steaming improvements in Hudi streaming Source and Sink
[ https://issues.apache.org/jira/browse/HUDI-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-4388: - Fix Version/s: 0.14.0 > Strucutred steaming improvements in Hudi streaming Source and Sink > -- > > Key: HUDI-4388 > URL: https://issues.apache.org/jira/browse/HUDI-4388 > Project: Apache Hudi > Issue Type: Epic >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 0.14.0, 1.0.0 > > > All improvements to structured steaming with HoodieStreamingSink and > HoodieStreamSource captured in this epic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-3940) Lock manager does not increment retry count upon exception
[ https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-3940. - Resolution: Fixed > Lock manager does not increment retry count upon exception > -- > > Key: HUDI-3940 > URL: https://issues.apache.org/jira/browse/HUDI-3940 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1, 0.14.0, 0.12.3, 0.13.0, 0.12.1, 0.12.0 > > > Came up while debugging CI failure: > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3940) Lock manager does not increment retry count upon exception
[ https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-3940: -- Fix Version/s: 0.13.1 0.12.3 0.13.0 0.12.1 0.12.0 > Lock manager does not increment retry count upon exception > -- > > Key: HUDI-3940 > URL: https://issues.apache.org/jira/browse/HUDI-3940 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0, 0.12.1, 0.13.0, 0.13.1, 0.12.3, 0.14.0 > > > Came up while debugging CI failure: > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8200: [MINOR] hoodie.datasource.write.row.writer.enable should set to be true.
nsivabalan commented on code in PR #8200: URL: https://github.com/apache/hudi/pull/8200#discussion_r1198551786 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java: ## @@ -108,7 +108,7 @@ public HoodieWriteMetadata> performClustering(final Hood Stream> writeStatusesStream = FutureUtils.allOf( clusteringPlan.getInputGroups().stream() .map(inputGroup -> { - if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) { + if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", true)) { Review Comment: lets also consider issues like https://github.com/apache/hudi/issues/8259 before we can make it default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark
bvaradar commented on PR #8303: URL: https://github.com/apache/hudi/pull/8303#issuecomment-1554000815 @jonvex : Is this ready for review ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3940) Lock manager does not increment retry count upon exception
[ https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3940: - Fix Version/s: 0.14.0 (was: 1.0.0) > Lock manager does not increment retry count upon exception > -- > > Key: HUDI-3940 > URL: https://issues.apache.org/jira/browse/HUDI-3940 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Came up while debugging CI failure: > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink
hudi-bot commented on PR #8749: URL: https://github.com/apache/hudi/pull/8749#issuecomment-1553999096 ## CI report: * 8bec3af536b80ec5838556f1337d13f06251b0ea Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17178) * c8e2c682741b9364ed44c6c70cd3962404daa1e1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17204) * 1958203e67af53e5deca919e91208388bfde257c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
rmahindra123 commented on code in PR #8574: URL: https://github.com/apache/hudi/pull/8574#discussion_r1198546865 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java: ## @@ -93,9 +105,17 @@ public List getTransformersNames() { @Override public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset rowDataset, TypedProperties properties) { Dataset dataset = rowDataset; +Option incomingSchemaOpt = sourceSchemaOpt; +if (!sourceSchemaOpt.isPresent()) { Review Comment: nit: sourceSchemaOpt -> incomingSchemaOpt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8747: [HUDI-6233] Fix table client conf in AlterTableCommand
Zouxxyy commented on code in PR #8747: URL: https://github.com/apache/hudi/pull/8747#discussion_r1198545658 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala: ## @@ -200,6 +201,13 @@ class TestAlterTable extends HoodieSparkSqlTestBase { checkAnswer(s"select id, name, price, ts, dt from $tableName2")( Seq(1, "a1", 10.0, 1000, null) ) + +if (HoodieSparkUtils.gteqSpark3_1) { + withSQLConf("hoodie.schema.on.read.enable" -> "true") { Review Comment: AlterTableCommand only work on spark datasourcev2 which is controlled by `hoodie.schema.on.read.enable` ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala: ## @@ -200,6 +201,13 @@ class TestAlterTable extends HoodieSparkSqlTestBase { checkAnswer(s"select id, name, price, ts, dt from $tableName2")( Seq(1, "a1", 10.0, 1000, null) ) + +if (HoodieSparkUtils.gteqSpark3_1) { + withSQLConf("hoodie.schema.on.read.enable" -> "true") { Review Comment: @danny0405 AlterTableCommand only work on spark datasourcev2 which is controlled by `hoodie.schema.on.read.enable` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…
danny0405 commented on code in PR #8755: URL: https://github.com/apache/hudi/pull/8755#discussion_r1198544786 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/StatsFileSizeProcedure.scala: ## @@ -54,8 +55,22 @@ class StatsFileSizeProcedure extends BaseProcedure with ProcedureBuilder { val globRegex = getArgValueOrDefault(args, parameters(1)).get.asInstanceOf[String] val limit: Int = getArgValueOrDefault(args, parameters(2)).get.asInstanceOf[Int] val basePath = getBasePath(table) -val fs = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build.getFs -val globPath = String.format("%s/%s/*", basePath, globRegex) +val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build +val fs = metaClient.getFs +val isTablePartitioned = metaClient.getTableConfig.isTablePartitioned +val maximumPartitionDepth = if (isTablePartitioned) metaClient.getTableConfig.getPartitionFields.get.length else 0 +val globPath = (metaClient.getTableConfig.isTablePartitioned, globRegex) match { Review Comment: ```suggestion val globPath = (isTablePartitioned, globRegex) match { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan commented on a diff in pull request #7561: [HUDI-5477] Optimize timeline loading in Hudi sync client
BruceKellan commented on code in PR #7561: URL: https://github.com/apache/hudi/pull/7561#discussion_r1198544567 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java: ## @@ -210,11 +210,30 @@ public static HoodieDefaultTimeline getTimeline(HoodieTableMetaClient metaClient return activeTimeline; } + /** + * Returns a Hudi timeline with commits after the given instant time (exclusive). + * + * @param metaClient{@link HoodieTableMetaClient} instance. + * @param exclusiveStartInstantTime Start instant time (exclusive). + * @return Hudi timeline. + */ + public static HoodieTimeline getCommitsTimelineAfter( + HoodieTableMetaClient metaClient, String exclusiveStartInstantTime) { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); +HoodieDefaultTimeline timeline = +activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime) +? metaClient.getArchivedTimeline(exclusiveStartInstantTime) +.mergeTimeline(activeTimeline) +: activeTimeline; +return timeline.getCommitsTimeline() +.findInstantsAfter(exclusiveStartInstantTime, Integer.MAX_VALUE); + } Review Comment: @yihua I have a doubt, since rollback and commit are archived separately, is it possible that there is a very early rollback instant, causing `activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime)` to return false? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1553993821 ## CI report: * cc0da2372d50d99c98c2ce4bcbe5a60303bde938 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17195) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime
danny0405 commented on code in PR #8762: URL: https://github.com/apache/hudi/pull/8762#discussion_r1198543483 ## hudi-common/src/main/java/org/apache/hudi/exception/HoodieInvalidInstantException.java: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.exception; + +/** + * Exception thrown for invalid instants whose name doesn't follow instant name format. + */ +public class HoodieInvalidInstantException extends HoodieException { + + public HoodieInvalidInstantException(String msg) { +super(msg); + } + Review Comment: Not a fan of checked exception, just give a more detailed exception msg should be fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8747: [HUDI-6233] Fix table client conf in AlterTableCommand
danny0405 commented on code in PR #8747: URL: https://github.com/apache/hudi/pull/8747#discussion_r1198542011 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala: ## @@ -200,6 +201,13 @@ class TestAlterTable extends HoodieSparkSqlTestBase { checkAnswer(s"select id, name, price, ts, dt from $tableName2")( Seq(1, "a1", 10.0, 1000, null) ) + +if (HoodieSparkUtils.gteqSpark3_1) { + withSQLConf("hoodie.schema.on.read.enable" -> "true") { Review Comment: Can you explain a little more why we need this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3049) Use flink table name as default synced hive table name
[ https://issues.apache.org/jira/browse/HUDI-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3049: - Fix Version/s: 0.14.0 > Use flink table name as default synced hive table name > -- > > Key: HUDI-3049 > URL: https://issues.apache.org/jira/browse/HUDI-3049 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Priority: Major > Fix For: 0.14.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3409) Expose Timeline Server Metrics
[ https://issues.apache.org/jira/browse/HUDI-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3409: - Fix Version/s: 0.14.0 (was: 1.0.0) > Expose Timeline Server Metrics > -- > > Key: HUDI-3409 > URL: https://issues.apache.org/jira/browse/HUDI-3409 > Project: Apache Hudi > Issue Type: Improvement > Components: timeline-server >Reporter: DarAmani Swift >Assignee: Rajesh >Priority: Major > Labels: new-to-hudi > Fix For: 0.14.0 > > > Timeline server metrics are pushed to local registry but never going to > reporters. Exposing these metrics would greatly improve debugging latency > around async processes and timeline server syncs. > Metrics are already captured in the [Request > Handler|https://github.com/apache/hudi/blob/master/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java#L527-L531] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6208) Fix jetty conflicts in the packaging process
[ https://issues.apache.org/jira/browse/HUDI-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6208. Fix Version/s: 0.14.0 Resolution: Fixed Fixed via master branch: 0d55c9d4a93957b0cbdbc4e7a6b3cf79e8d348fe > Fix jetty conflicts in the packaging process > > > Key: HUDI-6208 > URL: https://issues.apache.org/jira/browse/HUDI-6208 > Project: Apache Hudi > Issue Type: Bug > Components: timeline-server >Affects Versions: 0.14.0 > Environment: hudi-master >Reporter: eric >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: image-2023-05-15-09-48-18-179.png > > > !image-2023-05-15-09-48-18-179.png! > > > [[HUDI-6208]Fix jetty conflicts in the packaging process by eric9204 · Pull > Request #8706 · apache/hudi > (github.com)|https://github.com/apache/hudi/pull/8706] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (0b87e143cfe -> 0d55c9d4a93)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 0b87e143cfe [HUDI-6115] Adding hardening checks for transformer output schema for quarantine enabled/disabled (#8520) add 0d55c9d4a93 [HUDI-6208] Fix jetty conflicts in the packaging process (#8706) No new revisions were added by this update. Summary of changes: hudi-timeline-service/pom.xml | 6 ++ 1 file changed, 6 insertions(+)
[GitHub] [hudi] danny0405 merged pull request #8706: [HUDI-6208] Fix jetty conflicts in the packaging process
danny0405 merged PR #8706: URL: https://github.com/apache/hudi/pull/8706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8760: [HUDI-6238] Disabling clustering for single file group
danny0405 commented on PR #8760: URL: https://github.com/apache/hudi/pull/8760#issuecomment-1553986355 > since the stats are going to remain intact before and after sorting (total valid values, min and max). So, even when sorting is enabled, we should not trigger clustering when file group count is just 1 I don't think so, the sorting is not for column stats, it is for query optimization, when the parquet file is sorted, less column group cound be touched while filtering the columns by filter predicates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…
danny0405 commented on PR #8745: URL: https://github.com/apache/hudi/pull/8745#issuecomment-1553984735 Ping me again when it is ready to reivew. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #7359: [HUDI-3304] WIP - Allow selective partial update
xushiyan commented on PR #7359: URL: https://github.com/apache/hudi/pull/7359#issuecomment-1553974629 @bschell are you still working on this? the title says WIP -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8763: [HUDI-6239] fix clustering pool scheduler conf not take effect bug
hudi-bot commented on PR #8763: URL: https://github.com/apache/hudi/pull/8763#issuecomment-1553970163 ## CI report: * 64e77789e493cf252accab22fec1267c9402009f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17206) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime
hudi-bot commented on PR #8762: URL: https://github.com/apache/hudi/pull/8762#issuecomment-1553970142 ## CI report: * 9a2b1000c85524b5b541b4fc2d4d0b14eca30b44 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17205) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink
hudi-bot commented on PR #8749: URL: https://github.com/apache/hudi/pull/8749#issuecomment-1553970049 ## CI report: * 8bec3af536b80ec5838556f1337d13f06251b0ea Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17178) * c8e2c682741b9364ed44c6c70cd3962404daa1e1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17204) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #8759: Add metrics counters for compaction start/stop events.
SteNicholas commented on PR #8759: URL: https://github.com/apache/hudi/pull/8759#issuecomment-1553967689 @amrishlal, I don't think it's necessary to introduce the metrics in this pull request. The essence of demand in description is that monitor the different state of compaction action. Therefore, we should introduce different state metric of compaction action in state change phase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8763: [HUDI-6239] fix clustering pool scheduler conf not take effect bug
hudi-bot commented on PR #8763: URL: https://github.com/apache/hudi/pull/8763#issuecomment-1553965685 ## CI report: * 64e77789e493cf252accab22fec1267c9402009f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime
hudi-bot commented on PR #8762: URL: https://github.com/apache/hudi/pull/8762#issuecomment-1553965656 ## CI report: * 9a2b1000c85524b5b541b4fc2d4d0b14eca30b44 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink
hudi-bot commented on PR #8749: URL: https://github.com/apache/hudi/pull/8749#issuecomment-1553965553 ## CI report: * 8bec3af536b80ec5838556f1337d13f06251b0ea Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17178) * c8e2c682741b9364ed44c6c70cd3962404daa1e1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark
hudi-bot commented on PR #8303: URL: https://github.com/apache/hudi/pull/8303#issuecomment-1553964669 ## CI report: * b8772a74388873c35b1a13ba6ef99ecda9246646 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17165) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17203) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei updated HUDI-6239: --- Status: In Progress (was: Open) > cluster-scheduling-weight and cluster-scheduling-minShare not take effect in > deltastreamer > -- > > Key: HUDI-6239 > URL: https://issues.apache.org/jira/browse/HUDI-6239 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Minor > Attachments: image-2023-05-19-11-04-45-541.png, > image-2023-05-19-11-05-41-056.png > > > In the method > org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, > it will generate the spark scheduler conf for deltasync, compaction and > clustering. > But the clustering scheduler conf will not take effect. > The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool > !image-2023-05-19-11-04-45-541.png! > While the generateConfig take 3 pool as parameter. > !image-2023-05-19-11-05-41-056.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management
SteNicholas commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1198516655 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. +1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. +2. **KEEP_BY_COUNT**. Keep N sub-partitions for a high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration. +3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum of the data size of all sub-partitions does not exceed the policy configuration. +2. User need to set different policies for different partitions. For example, the hudi table is partitioned by two fields (user_id, ts). For partition(user_id='1'), we set the policy to keep 100G data for all sub-partitions, and for partition(user_id='2') we set the policy that all sub-partitions will expire 10 days after their last modified time. +3. It's possible that there are a lot of high-level partitions in the user's table, and they don't want to set TTL policies for all the high-level partitions. So we need to provide a default policy mechanism so that users can set a default policy for all high-level partitions and add some explicit policies for some of them if needed. Explicit policies will override the default policy. + +So here we have the TTL policy definition: +```java +public class HoodiePartitionTTLPolicy { + public enum TTLPolicy { +KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT + } + + // Partition spec for which the policy takes effect + private String partitionSpec; + + private TTLPolicy policy; + + private long policyValue; +} +``` + +### User Interface for TTL policy +Users can config partition TTL management policies through SparkSQL Call Command and through table config directly. Assume that the user has a hudi table partitioned by two fields(user_id, ts), he can config partition TTL policies as follows. + +```sql +-- Set default policy for all user_id, which keeps the data for 30 days. +call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 'KEEP_BY_TIME', policyValue => '30'); Review Comment: @stream2000, could the `add_ttl_policy` procedure add the `type` to specify the ttl policy type, which value could be `partition` etc? ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. Ho
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management
SteNicholas commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r119851 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. +1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. +2. **KEEP_BY_COUNT**. Keep N sub-partitions for a high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration. +3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum of the data size of all sub-partitions does not exceed the policy configuration. +2. User need to set different policies for different partitions. For example, the hudi table is partitioned by two fields (user_id, ts). For partition(user_id='1'), we set the policy to keep 100G data for all sub-partitions, and for partition(user_id='2') we set the policy that all sub-partitions will expire 10 days after their last modified time. +3. It's possible that there are a lot of high-level partitions in the user's table, and they don't want to set TTL policies for all the high-level partitions. So we need to provide a default policy mechanism so that users can set a default policy for all high-level partitions and add some explicit policies for some of them if needed. Explicit policies will override the default policy. + +So here we have the TTL policy definition: +```java +public class HoodiePartitionTTLPolicy { + public enum TTLPolicy { +KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT + } + + // Partition spec for which the policy takes effect + private String partitionSpec; + + private TTLPolicy policy; + + private long policyValue; +} +``` + +### User Interface for TTL policy +Users can config partition TTL management policies through SparkSQL Call Command and through table config directly. Assume that the user has a hudi table partitioned by two fields(user_id, ts), he can config partition TTL policies as follows. + +```sql +-- Set default policy for all user_id, which keeps the data for 30 days. +call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 'KEEP_BY_TIME', policyValue => '30'); + +--For partition user_id=1/, keep 10 sub partitions. +call add_ttl_policy(table => 'test', partitionSpec => 'user_id=1/', policy => 'KEEP_BY_COUNT', policyValue => '10'); + +--For partition user_id=2/, keep 100GB data in total +call add_ttl_policy(table => 'test', partitionSpec => 'user_id=2/', policy => 'KEEP_BY_SIZE', policyValue => '107374182400'); + +--For partition user_id=3/, keep the data for 7 day. +call add_ttl_policy(table => 'test', partitionSpec => 'user_id=3/', policy => 'KEEP_BY_TIME', policyValue => '7'); + +-- Show all the TTL policies including default and explicit policies +call show_ttl_policies(table => 'test'); +user_id=*/ KEEP_BY_TIME30 +user_id=1/ KEEP_BY_COUNT 10 +user_id=2/ KEEP_BY_SIZE107374182400 +user_id=3/ KEEP_BY_TIME7 +``` + +### Storage for TTL policy +The partition TTL policies will be stored in `hoodie.properties`since it is part of table metadata. The policy configs in `hoodie.properties`are defined as follows. Explicit policies are defined using a JSON array while default policy is de
[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark
hudi-bot commented on PR #8303: URL: https://github.com/apache/hudi/pull/8303#issuecomment-1553959688 ## CI report: * b8772a74388873c35b1a13ba6ef99ecda9246646 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] waitingF opened a new pull request, #8763: [HUDI-6239] fix clustering pool scheduler conf not take effect bug
waitingF opened a new pull request, #8763: URL: https://github.com/apache/hudi/pull/8763 ### Change Logs In the method org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, it will generate the spark scheduler conf for deltasync, compaction and clustering. But the clustering scheduler conf will not take effect. The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool ![image](https://github.com/apache/hudi/assets/19326824/2673af27-e6b9-42cc-88ff-d3deab21793b) While the generateConfig take 3 pool as parameter. ![image](https://github.com/apache/hudi/assets/19326824/67ca6e43-3e49-4724-8e64-d69cb57d9991) ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management
SteNicholas commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1198516655 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. +1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. +2. **KEEP_BY_COUNT**. Keep N sub-partitions for a high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration. +3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum of the data size of all sub-partitions does not exceed the policy configuration. +2. User need to set different policies for different partitions. For example, the hudi table is partitioned by two fields (user_id, ts). For partition(user_id='1'), we set the policy to keep 100G data for all sub-partitions, and for partition(user_id='2') we set the policy that all sub-partitions will expire 10 days after their last modified time. +3. It's possible that there are a lot of high-level partitions in the user's table, and they don't want to set TTL policies for all the high-level partitions. So we need to provide a default policy mechanism so that users can set a default policy for all high-level partitions and add some explicit policies for some of them if needed. Explicit policies will override the default policy. + +So here we have the TTL policy definition: +```java +public class HoodiePartitionTTLPolicy { + public enum TTLPolicy { +KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT + } + + // Partition spec for which the policy takes effect + private String partitionSpec; + + private TTLPolicy policy; + + private long policyValue; +} +``` + +### User Interface for TTL policy +Users can config partition TTL management policies through SparkSQL Call Command and through table config directly. Assume that the user has a hudi table partitioned by two fields(user_id, ts), he can config partition TTL policies as follows. + +```sql +-- Set default policy for all user_id, which keeps the data for 30 days. +call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 'KEEP_BY_TIME', policyValue => '30'); Review Comment: Could the `add_ttl_policy` procedure add the `type` to specify the ttl policy type, which value could be `partition` etc. ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users
[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management
SteNicholas commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1198516174 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. +1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. +2. **KEEP_BY_COUNT**. Keep N sub-partitions for a high-level partition. When sub partition count exceeds, delete the partitions with smaller partition values until the sub-partition count meets the policy configuration. +3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum of the data size of all sub-partitions does not exceed the policy configuration. +2. User need to set different policies for different partitions. For example, the hudi table is partitioned by two fields (user_id, ts). For partition(user_id='1'), we set the policy to keep 100G data for all sub-partitions, and for partition(user_id='2') we set the policy that all sub-partitions will expire 10 days after their last modified time. +3. It's possible that there are a lot of high-level partitions in the user's table, and they don't want to set TTL policies for all the high-level partitions. So we need to provide a default policy mechanism so that users can set a default policy for all high-level partitions and add some explicit policies for some of them if needed. Explicit policies will override the default policy. + +So here we have the TTL policy definition: +```java +public class HoodiePartitionTTLPolicy { Review Comment: Could we introduce `HoodieTTLPolicy` interface? Then `HoodiePartitionTTLPolicy` implements the `HoodieTTLPolicy`. `HoodieRecordTTLPolicy` could also implement this interface in feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management
SteNicholas commented on PR #8062: URL: https://github.com/apache/hudi/pull/8062#issuecomment-1553957249 @stream2000, could we also introduce record ttl management? Partition ttl management and record ttl management both need the ttl policy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs opened a new pull request, #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime
boneanxs opened a new pull request, #8762: URL: https://github.com/apache/hudi/pull/8762 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ 1. Avoid having acronym in the api 2. Give more context for the exception 3. Ensure time travel won't affect by this ### Impact _Describe any public API or user-facing feature change or any performance impact._ none ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei updated HUDI-6239: --- Priority: Minor (was: Major) > cluster-scheduling-weight and cluster-scheduling-minShare not take effect in > deltastreamer > -- > > Key: HUDI-6239 > URL: https://issues.apache.org/jira/browse/HUDI-6239 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Minor > Attachments: image-2023-05-19-11-04-45-541.png, > image-2023-05-19-11-05-41-056.png > > > In the method > org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, > it will generate the spark scheduler conf for deltasync, compaction and > clustering. > But the clustering scheduler conf will not take effect. > The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool > !image-2023-05-19-11-04-45-541.png! > While the generateConfig take 3 pool as parameter. > !image-2023-05-19-11-05-41-056.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…
hudi-bot commented on PR #8755: URL: https://github.com/apache/hudi/pull/8755#issuecomment-1553940371 ## CI report: * 2b0ddb3813e46f5f71a357f1fc2191801b17beb6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17191) * 9792b4220f6fc0700975a3883e19336d21457020 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17202) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei reassigned HUDI-6239: -- Assignee: Kong Wei > cluster-scheduling-weight and cluster-scheduling-minShare not take effect in > deltastreamer > -- > > Key: HUDI-6239 > URL: https://issues.apache.org/jira/browse/HUDI-6239 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Major > Attachments: image-2023-05-19-11-04-45-541.png, > image-2023-05-19-11-05-41-056.png > > > In the method > org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, > it will generate the spark scheduler conf for deltasync, compaction and > clustering. > But the clustering scheduler conf will not take effect. > The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool > !image-2023-05-19-11-04-45-541.png! > While the generateConfig take 3 pool as parameter. > !image-2023-05-19-11-05-41-056.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer
Kong Wei created HUDI-6239: -- Summary: cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer Key: HUDI-6239 URL: https://issues.apache.org/jira/browse/HUDI-6239 Project: Apache Hudi Issue Type: Bug Components: deltastreamer Reporter: Kong Wei Attachments: image-2023-05-19-11-04-45-541.png, image-2023-05-19-11-05-41-056.png In the method org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, it will generate the spark scheduler conf for deltasync, compaction and clustering. But the clustering scheduler conf will not take effect. The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool !image-2023-05-19-11-04-45-541.png! While the generateConfig take 3 pool as parameter. !image-2023-05-19-11-05-41-056.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…
hudi-bot commented on PR #8755: URL: https://github.com/apache/hudi/pull/8755#issuecomment-1553936360 ## CI report: * 2b0ddb3813e46f5f71a357f1fc2191801b17beb6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17191) * 9792b4220f6fc0700975a3883e19336d21457020 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] samserpoosh opened a new issue, #8761: [SUPPORT] "Illegal Lambda Deserialization" When Leveraging PostgresDebeziumSource
samserpoosh opened a new issue, #8761: URL: https://github.com/apache/hudi/issues/8761 ### Describe The Problem You Faced** I'm trying to get Postgres CDC events published to Kafka by Debezium ingested into a partitioned Hudi Table in S3. I'm currently testing this E2E Data Flow using a dummy and pretty simple DB Table. When submitting the DeltaStreamer job, it throws the "**Illegal Lambda Deserialization**" exception. ### To Reproduce - My `spark-submit` Command: ``` spark-submit \ --jars "opt/spark/jars/hudi-utils-bundle.jar,..." \ --master spark://:7077 \ --total-executor-cores 1 \ --executor-memory 4g \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.connection.maximum=1 \ --conf spark.scheduler.mode=FAIR \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utils-bundle.jar \ --table-type COPY_ON_WRITE \ --target-base-path s3a://path/to/samser_customers \ --target-table samser_customers \ --min-sync-interval-seconds 30 \ --source-class org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource \ --payload-class org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload \ --source-ordering-field _event_lsn \ --op UPSERT \ --continuous \ --source-limit 5000 \ --hoodie-conf bootstrap.servers=:9092 \ --hoodie-conf group.id= \ --hoodie-conf schema.registry.url=http://:8081 \ --hoodie-conf hoodie.deltastreamer.schemaprovider.registry.url=http://:8081/subjects/-value/versions/1 \ --hoodie-conf hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer \ --hoodie-conf hoodie.deltastreamer.source.kafka.topic= \ --hoodie-conf auto.offset.reset=earliest \ --hoodie-conf hoodie.datasource.write.recordkey.field=id \ --hoodie-conf hoodie.datasource.write.partitionpath.field=name \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \ --hoodie-conf hoodie.datasource.write.precombine.field=_event_lsn \ --hoodie-conf hoodie.metadata.enable=true \ --hoodie-conf hoodie.metadata.index.column.stats.enable=true \ --hoodie-conf hoodie.parquet.small.file.limit=134217728 ``` - Relevant Debezium-PG Configuration: ``` class: io.debezium.connector.postgresql.PostgresConnector plugin.name: pgoutput database.hostname: database.port: 5432 database.user: database.password: database.dbname : topic.prefix: schema.include.list: public key.converter: io.confluent.connect.avro.AvroConverter key.converter.schema.registry.url: http://:8081
[GitHub] [hudi] c-f-cooper commented on issue #8651: [SUPPORT]How to resolve small file?
c-f-cooper commented on issue #8651: URL: https://github.com/apache/hudi/issues/8651#issuecomment-1553931467 ![image](https://github.com/apache/hudi/assets/25735549/8b816399-ede9-4a2a-97b5-d28e7ef3b1e4) ![023D4646-7D12-4606-8188-0F1A05DE47C5_1_102_o](https://github.com/apache/hudi/assets/25735549/af7586f4-5be8-4efd-a895-d56108c8c878) I found that the async cluster shedule done,but not execute @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8747: [HUDI-6233] Fix table client conf in AlterTableCommand
hudi-bot commented on PR #8747: URL: https://github.com/apache/hudi/pull/8747#issuecomment-1553931308 ## CI report: * 72b2d6da4377e18a600857d9ae6eb2766c786c12 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17192) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8752: [HUDI-6236] write hive_style_partitioning_enable to table config in D…
hudi-bot commented on PR #8752: URL: https://github.com/apache/hudi/pull/8752#issuecomment-1553931378 ## CI report: * 7762747d22f8ffade79936aa3465db3ce89045db UNKNOWN * f9b5f2d4727ffabc20e3e28e78e49009f6fa221e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17189) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17201) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on pull request #8752: [HUDI-6236] write hive_style_partitioning_enable to table config in D…
voonhous commented on PR #8752: URL: https://github.com/apache/hudi/pull/8752#issuecomment-1553924948 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 commented on a diff in pull request #8706: [HUDI-6208] Fix jetty conflicts in the packaging process
eric9204 commented on code in PR #8706: URL: https://github.com/apache/hudi/pull/8706#discussion_r1198487611 ## hudi-timeline-service/pom.xml: ## @@ -87,6 +87,12 @@ kryo-shaded + + org.eclipse.jetty + jetty-util + ${jetty.version} Review Comment: Thanks for ur help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #7627: [HUDI-5517] HoodieTimeline support filter instants by state transition time
boneanxs commented on code in PR #7627: URL: https://github.com/apache/hudi/pull/7627#discussion_r1198483774 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestStreamingSource.scala: ## @@ -51,45 +53,46 @@ class TestStreamingSource extends StreamTest { withTempDir { inputDir => val tablePath = s"${inputDir.getCanonicalPath}/test_cow_stream" HoodieTableMetaClient.withPropertyBuilder() - .setTableType(COPY_ON_WRITE) - .setTableName(getTableName(tablePath)) - .setPayloadClassName(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.defaultValue) +.setTableType(COPY_ON_WRITE) +.setTableName(getTableName(tablePath)) + .setPayloadClassName(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.defaultValue) .setPreCombineField("ts") - .initTable(spark.sessionState.newHadoopConf(), tablePath) +.initTable(spark.sessionState.newHadoopConf(), tablePath) addData(tablePath, Seq(("1", "a1", "10", "000"))) val df = spark.readStream .format("org.apache.hudi") +.option(DataSourceReadOptions.READ_BY_STATE_TRANSITION_TIME.key(), useTransitionTime) Review Comment: By default `useTransitionTime` is false, and this test covers the default commit instant time, while `TestStreamSourceReadByStateTransitionTime` extends this class and override `useTransitionTime` to true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6115] Adding hardening checks for transformer output schema for quarantine enabled/disabled (#8520)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0b87e143cfe [HUDI-6115] Adding hardening checks for transformer output schema for quarantine enabled/disabled (#8520) 0b87e143cfe is described below commit 0b87e143cfe237ddc005f610d208d1bf36432ba3 Author: harshal AuthorDate: Fri May 19 07:02:11 2023 +0530 [HUDI-6115] Adding hardening checks for transformer output schema for quarantine enabled/disabled (#8520) - Adds ERROR_TABLE_CURRUPT_RECORD_COL_NAME as a null value column if the error table is enabled for transformers and the column does not exist in the dataset. - Adds validation for ERROR_TABLE_CURRUPT_RECORD_COL_NAME column to be part of the transformer in cases of error table is enabled/disabled. --- .../org/apache/hudi/utilities/UtilHelpers.java | 7 +- .../hudi/utilities/deltastreamer/DeltaSync.java| 4 +- .../utilities/deltastreamer/ErrorTableUtils.java | 33 - .../utilities/transform/ChainedTransformer.java| 8 +- .../ErrorTableAwareChainedTransformer.java | 58 .../TestErrorTableAwareChainedTransformer.java | 150 + 6 files changed, 250 insertions(+), 10 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java index 721ba2eb9f4..16ed7eadc1f 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java @@ -61,6 +61,7 @@ import org.apache.hudi.utilities.schema.postprocessor.ChainedSchemaPostProcessor import org.apache.hudi.utilities.sources.Source; import org.apache.hudi.utilities.sources.processor.ChainedJsonKafkaSourcePostProcessor; import org.apache.hudi.utilities.sources.processor.JsonKafkaSourcePostProcessor; +import org.apache.hudi.utilities.transform.ErrorTableAwareChainedTransformer; import org.apache.hudi.utilities.transform.ChainedTransformer; import org.apache.hudi.utilities.transform.Transformer; @@ -190,9 +191,11 @@ public class UtilHelpers { } - public static Option createTransformer(Option> classNamesOpt) throws IOException { + public static Option createTransformer(Option> classNamesOpt, Boolean isErrorTableWriterEnabled) throws IOException { try { - return classNamesOpt.map(classNames -> classNames.isEmpty() ? null : new ChainedTransformer(classNames)); + return classNamesOpt.map(classNames -> classNames.isEmpty() ? null : + isErrorTableWriterEnabled ? new ErrorTableAwareChainedTransformer(classNames) : new ChainedTransformer(classNames) + ); } catch (Throwable e) { throw new IOException("Could not load transformer class(es) " + classNamesOpt.get(), e); } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java index 7d1d0758955..cbd19305e41 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java @@ -289,7 +289,6 @@ public class DeltaSync implements Serializable, Closeable { // Register User Provided schema first registerAvroSchemas(schemaProvider); -this.transformer = UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames)); this.metrics = (HoodieIngestionMetrics) ReflectionUtils.loadClass(cfg.ingestionMetricsClass, getHoodieClientConfig(this.schemaProvider)); this.hoodieMetrics = new HoodieMetrics(getHoodieClientConfig(this.schemaProvider)); @@ -306,6 +305,9 @@ public class DeltaSync implements Serializable, Closeable { this.formatAdapter = new SourceFormatAdapter( UtilHelpers.createSource(cfg.sourceClassName, props, jssc, sparkSession, schemaProvider, metrics), this.errorTableWriter, Option.of(props)); + +this.transformer = UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames), this.errorTableWriter.isPresent()); + } /** diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java index 881a9545461..76e7b030b6f 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java @@ -28,24 +28,29 @@ import org.apache.hudi.config.HoodieErrorTableConfig; import org.apache.hudi.exception.HoodieException; import org.apache.hadoop.fs.FileSystem; +import org.apac
[GitHub] [hudi] codope merged pull request #8520: [HUDI-6115] Hardening expectation of corruptRecordColumn in ChainedTransformer.
codope merged PR #8520: URL: https://github.com/apache/hudi/pull/8520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8714: [HUDI-6212] Hudi spark 3.0.x adoption
yihua commented on code in PR #8714: URL: https://github.com/apache/hudi/pull/8714#discussion_r1198414998 ## .github/workflows/bot.yml: ## @@ -63,6 +63,10 @@ jobs: sparkProfile: "spark3.1" sparkModules: "hudi-spark-datasource/hudi-spark3.1.x" + - scalaProfile: "scala-2.12" Review Comment: Would be good to add bundle validation on Spark 3.0.x too in `validate-bundles` section ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -114,58 +114,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo }) } - test("Test MergeInto with more than once update actions") { -withRecordType()(withTempDir {tmp => - val targetTable = generateTableName - spark.sql( -s""" - |create table ${targetTable} ( - | id int, - | name string, - | data int, - | country string, - | ts bigint - |) using hudi - |tblproperties ( - | type = 'cow', - | primaryKey = 'id', - | preCombineField = 'ts' - | ) - |partitioned by (country) - |location '${tmp.getCanonicalPath}/$targetTable' - |""".stripMargin) - spark.sql( -s""" - |merge into ${targetTable} as target - |using ( - |select 1 as id, 'lb' as name, 6 as data, 'shu' as country, 1646643193 as ts - |) source - |on source.id = target.id - |when matched then - |update set * - |when not matched then - |insert * - |""".stripMargin) - spark.sql( -s""" - |merge into ${targetTable} as target - |using ( - |select 1 as id, 'lb' as name, 5 as data, 'shu' as country, 1646643196 as ts - |) source - |on source.id = target.id - |when matched and source.data > target.data then - |update set target.data = source.data, target.ts = source.ts - |when matched and source.data = 5 then - |update set target.data = source.data, target.ts = source.ts - |when not matched then - |insert * - |""".stripMargin) - - checkAnswer(s"select id, name, data, country, ts from $targetTable")( -Seq(1, "lb", 5, "shu", 1646643196L) - ) + /** + * For spark3.0.x didn't support 'UPDATE and DELETE can appear at most once in MATCHED clauses in a MERGE statement' + * details: org.apache.spark.sql.catalyst.parser.AstBuilder#visitMergeIntoTable Review Comment: ```suggestion * In Spark 3.0.x, UPDATE and DELETE can appear at most once in MATCHED clauses in a MERGE INTO statement. * Refer to: `org.apache.spark.sql.catalyst.parser.AstBuilder#visitMergeIntoTable` ``` ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/avro/TestAvroSerDe.scala: ## @@ -20,11 +20,12 @@ package org.apache.spark.sql.avro import org.apache.avro.generic.GenericData import org.apache.hudi.SparkAdapterSupport import org.apache.hudi.avro.model.{HoodieMetadataColumnStats, IntWrapper} +import org.apache.spark.internal.Logging import org.apache.spark.sql.avro.SchemaConverters.SchemaType import org.junit.jupiter.api.Assertions.assertEquals import org.junit.jupiter.api.Test -class TestAvroSerDe extends SparkAdapterSupport { +class TestAvroSerDe extends SparkAdapterSupport with Logging { Review Comment: Is this necessary for testing? ## hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/HoodieSpark3CatalystExpressionUtils.scala: ## @@ -17,16 +17,9 @@ package org.apache.spark.sql -import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Expression, Predicate, PredicateHelper} -import org.apache.spark.sql.execution.datasources.DataSourceStrategy - -trait HoodieSpark3CatalystExpressionUtils extends HoodieCatalystExpressionUtils - with PredicateHelper { - - override def normalizeExprs(exprs: Seq[Expression], attributes: Seq[Attribute]): Seq[Expression] = -DataSourceStrategy.normalizeExprs(exprs, attributes) - - override def extractPredicatesWithinOutputSet(condition: Expression, -outputSet: AttributeSet): Option[Expression] = -super[PredicateHelper].extractPredicatesWithinOutputSet(condition, outputSet) +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Expression} +abstract class HoodieSpark3CatalystExpressionUtils extends HoodieCatalystExpressionUtils { Review Comment: could we keep `with PredicateHelper`? ## hudi-spark-datasource/hudi-spark3.0.x/src/main/java/org/apache/spark/sql/execution/datasources/parquet/Spark30HoodieVectorizedParquetRecordReader.java: ## @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache So
[GitHub] [hudi] fujianhua168 commented on issue #8754: [SUPPORT] PrestoDB encountered data quality issues while reading the Hudi Mor table.
fujianhua168 commented on issue #8754: URL: https://github.com/apache/hudi/issues/8754#issuecomment-1553872378 > @fujianhua168 This is a known issue that log files are not read during compation by the connector. I am working on the fix and will put up a patch early next week. It should be fixed in next Presto release. However, for Trino, currently we don't support MOR snapshot query. It's still under review. First of all, thank you very much for your contribution. In fact, Trino's support for snapshot queries is urgently needed by us (note: many data developers in our company use Trino). Currently, we are unable to query the Hudi mor table through Trino, and we hope that Trino can be placed first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…
hudi-bot commented on PR #8755: URL: https://github.com/apache/hudi/pull/8755#issuecomment-1553861029 ## CI report: * 2b0ddb3813e46f5f71a357f1fc2191801b17beb6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17191) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8752: [HUDI-6236] write hive_style_partitioning_enable to table config in D…
hudi-bot commented on PR #8752: URL: https://github.com/apache/hudi/pull/8752#issuecomment-1553800662 ## CI report: * 7762747d22f8ffade79936aa3465db3ce89045db UNKNOWN * f9b5f2d4727ffabc20e3e28e78e49009f6fa221e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17189) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
hudi-bot commented on PR #8758: URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553767852 ## CI report: * a3122900f5c45636d4199da29276f240776fba73 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17197) * 3cafa50dd40057e2df678d5936e5b926f4ee77f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17199) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8753: [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data
hudi-bot commented on PR #8753: URL: https://github.com/apache/hudi/pull/8753#issuecomment-1553767810 ## CI report: * 45446c0cbf27d46589e8de1e9cc66221c420e353 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17188) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8760: [HUDI-6238] Disabling clustering for single file group
hudi-bot commented on PR #8760: URL: https://github.com/apache/hudi/pull/8760#issuecomment-1553761798 ## CI report: * 6df809a86f0678a952b496eb95e0d5715ca7c401 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
hudi-bot commented on PR #8758: URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553761751 ## CI report: * 364da977ec98223c416a37128e55ab033782f0b2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193) * a3122900f5c45636d4199da29276f240776fba73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17197) * 3cafa50dd40057e2df678d5936e5b926f4ee77f8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan opened a new pull request, #8760: [HUDI-6238] Disabling clustering for single file group
nsivabalan opened a new pull request, #8760: URL: https://github.com/apache/hudi/pull/8760 ### Change Logs When there is only one file group for a given partition, we should avoid clustering irrespective of whether sorting is enabled or not. Even if the data within single file group may not be sorted, we don't really need to sort them, since the stats are going to remain intact before and after sorting (total valid values, min and max). So, even when sorting is enabled, we should not trigger clustering when file group count is just 1. ### Impact Will assist in avoid repeated clustering for partitions having 1 file group. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6238) Avoid clustering when there is only one file slice
sivabalan narayanan created HUDI-6238: - Summary: Avoid clustering when there is only one file slice Key: HUDI-6238 URL: https://issues.apache.org/jira/browse/HUDI-6238 Project: Apache Hudi Issue Type: Improvement Components: clustering Reporter: sivabalan narayanan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8759: Add metrics counters for compaction start/stop events.
hudi-bot commented on PR #8759: URL: https://github.com/apache/hudi/pull/8759#issuecomment-1553725623 ## CI report: * fbdd1d299bdf653c65f21c374e0aada9b768318f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17198) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8759: Add metrics counters for compaction start/stop events.
hudi-bot commented on PR #8759: URL: https://github.com/apache/hudi/pull/8759#issuecomment-1553719273 ## CI report: * fbdd1d299bdf653c65f21c374e0aada9b768318f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8714: [HUDI-6212] Hudi spark 3.0.x adoption
hudi-bot commented on PR #8714: URL: https://github.com/apache/hudi/pull/8714#issuecomment-1553710804 ## CI report: * b3da8cccadddc1cc95c08ad0643a763726a9a010 UNKNOWN * 8dbee823426fe3a74d68084e1c47aedc90939a7a UNKNOWN * 722fae1d1a8717873ed89b87fb08b7c74fa3ccf5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17187) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
nsivabalan commented on code in PR #8758: URL: https://github.com/apache/hudi/pull/8758#discussion_r1198316621 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java: ## @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.index; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.data.HoodieData; +import org.apache.hudi.common.data.HoodiePairData; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.model.HoodieAvroRecord; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordGlobalLocation; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.view.HoodieTableFileSystemView; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.common.util.collection.ImmutablePair; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.data.HoodieJavaPairRDD; +import org.apache.hudi.data.HoodieJavaRDD; +import org.apache.hudi.exception.HoodieIndexException; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.hudi.metadata.HoodieTableMetadata; +import org.apache.hudi.metadata.HoodieTableMetadataUtil; +import org.apache.hudi.metadata.MetadataPartitionType; +import org.apache.hudi.table.HoodieTable; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.function.PairFlatMapFunction; +import org.apache.spark.sql.execution.PartitionIdPassthrough; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import scala.Tuple2; + +import java.util.ArrayList; +import java.util.Iterator; +import java.util.List; +import java.util.Map; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN; + +/** + * Hoodie Index implementation backed by the record index present in the Metadata Table. + */ +public class SparkMetadataTableRecordIndex extends HoodieIndex { + + private static final Logger LOG = LoggerFactory.getLogger(SparkMetadataTableRecordIndex.class); + + public SparkMetadataTableRecordIndex(HoodieWriteConfig config) { +super(config); + } + + @Override + public HoodieData> tagLocation(HoodieData> records, HoodieEngineContext context, HoodieTable hoodieTable) throws HoodieIndexException { +int fileGroupSize; +try { + ValidationUtils.checkState(hoodieTable.getMetaClient().getTableConfig().isMetadataPartitionEnabled(MetadataPartitionType.RECORD_INDEX)); + fileGroupSize = HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(hoodieTable.getMetaClient(), (HoodieTableFileSystemView) hoodieTable.getFileSystemView(), + MetadataPartitionType.RECORD_INDEX.getPartitionPath()).size(); + ValidationUtils.checkState(fileGroupSize > 0, "Record index should have at least one file group"); +} catch (TableNotFoundException | IllegalStateException e) { + // This means that record index has not been initialized. Fallback to another index so that tagLocation is still accurate and there are no duplicates. + // Fallback index needs to be a global index like record index. + HoodieIndex.IndexType fallbackIndexType = IndexType.SIMPLE; + LOG.warn(String.format("Record index not initialized so falling back to %s for tagging records", fallbackIndexType.name())); + HoodieWriteConfig otherConfig = HoodieWriteConfig.newBuilder().withProperties(config.getProps()) + .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(fallbackIndexType).build()).build(); + HoodieIndex fallbackIndex = SparkHoodieIndexFactory.createIndex(otherConfig); + return fallbackIndex.tagLocation(records, context, hoodieTable); +} + +// final variable required for lamda functions below +final int numFileGroups = fileGroupSize; + +
[GitHub] [hudi] amrishlal opened a new pull request, #8759: Add metrics counters for compaction start/stop events.
amrishlal opened a new pull request, #8759: URL: https://github.com/apache/hudi/pull/8759 ### Change Logs Add metrics counters for compaction start/stop events so that we can keep track of how many compactions were requested, how many finished, and how many produced error (interfered as `number of starts - number of finished`). ### Impact No user api for performance impact expected. ### Risk level (write none, low medium or high below) Low ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
hudi-bot commented on PR #8758: URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553650530 ## CI report: * 364da977ec98223c416a37128e55ab033782f0b2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193) * a3122900f5c45636d4199da29276f240776fba73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17197) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8638: added new exception types
hudi-bot commented on PR #8638: URL: https://github.com/apache/hudi/pull/8638#issuecomment-1553649953 ## CI report: * c8cf2d86b1be30d3215b3b6e89b8bda33a1fe5dc UNKNOWN * 333d9faa53e71ba535a7cb8c60ce8b350a33452c UNKNOWN * 6898285dd6ddba725ae33b73aa92afa02beb98a7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17090) * aa35b5562c16840b5ebf143009beac2c291de2c9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17196) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8638: added new exception types
hudi-bot commented on PR #8638: URL: https://github.com/apache/hudi/pull/8638#issuecomment-1553638442 ## CI report: * c8cf2d86b1be30d3215b3b6e89b8bda33a1fe5dc UNKNOWN * 333d9faa53e71ba535a7cb8c60ce8b350a33452c UNKNOWN * 6898285dd6ddba725ae33b73aa92afa02beb98a7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17090) * aa35b5562c16840b5ebf143009beac2c291de2c9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
hudi-bot commented on PR #8758: URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553639447 ## CI report: * 364da977ec98223c416a37128e55ab033782f0b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193) * a3122900f5c45636d4199da29276f240776fba73 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
rmahindra123 commented on code in PR #8574: URL: https://github.com/apache/hudi/pull/8574#discussion_r1198260778 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java: ## @@ -45,4 +47,9 @@ public interface Transformer { */ @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE) Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset rowDataset, TypedProperties properties); + + @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING) + default Option transformedSchema(JavaSparkContext jsc, SparkSession sparkSession, Schema incomingSchema, TypedProperties properties) { +return Option.empty(); Review Comment: 1. Switch to StructType instead of Avro? @rmahindra123 will confirm 2. Default should infer schema using spark plan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1553587888 ## CI report: * 7be538f4045a42ba33ab8fe62594178e2db75bbb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17100) * cc0da2372d50d99c98c2ce4bcbe5a60303bde938 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17195) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer
rmahindra123 commented on code in PR #8574: URL: https://github.com/apache/hudi/pull/8574#discussion_r1198256724 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java: ## @@ -93,9 +103,13 @@ public List getTransformersNames() { @Override public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, Dataset rowDataset, TypedProperties properties) { Dataset dataset = rowDataset; +Option incomingSchemaOpt = sourceSchemaOpt; for (TransformerInfo transformerInfo : transformers) { Transformer transformer = transformerInfo.getTransformer(); dataset = transformer.apply(jsc, sparkSession, dataset, transformerInfo.getProperties(properties)); + if (enableSchemaValidation) { +incomingSchemaOpt = validateAndGetTransformedSchema(transformerInfo, dataset, incomingSchemaOpt, jsc, sparkSession, properties); Review Comment: Implement the new interface for chained transformer and validate before the dataset apply is called. Validation should be in the new interface instead of within the apply method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.
hudi-bot commented on PR #8684: URL: https://github.com/apache/hudi/pull/8684#issuecomment-1553577735 ## CI report: * 7be538f4045a42ba33ab8fe62594178e2db75bbb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17100) * cc0da2372d50d99c98c2ce4bcbe5a60303bde938 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8520: [HUDI-6115] Hardening expectation of corruptRecordColumn in ChainedTransformer.
hudi-bot commented on PR #8520: URL: https://github.com/apache/hudi/pull/8520#issuecomment-1553567992 ## CI report: * 0ad850ba8e43e954d5a83ffb6a4e68bc3e3dd68b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17185) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (9ef7bd8a675 -> cfa02f2dd99)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 9ef7bd8a675 [HUDI-5394] Fix tests for RowCustomColumnsSortPartitioner (#8741) add cfa02f2dd99 [HUDI-6228] Re-enable tests that were flaky before (#8733) No new revisions were added by this update. Summary of changes: .../apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java | 3 --- 1 file changed, 3 deletions(-)
[GitHub] [hudi] yihua commented on pull request #8733: [HUDI-6228] Re-enable tests that were flaky before
yihua commented on PR #8733: URL: https://github.com/apache/hudi/pull/8733#issuecomment-1553522777 > ## CI report: > * [d8d8926](https://github.com/apache/hudi/commit/d8d892647691c4bdf9f7bb78313db328517d7552) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17145) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17159) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17170) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17175) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17184) > > Bot commands The fourth run failed due to other flaky tests in `hudi-common` and memory issue of the Azure worker. The fifth run failed due to a Spark flaky test. Both are not due to the re-enabled tests added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nbalajee commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management
nbalajee commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1198202805 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. +1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. Review Comment: When retiring the old/unused/not-accessed partitions, another approach we are taking internally is: (a) stash the partitions to be cleaned up in .stashedForDeletion folder (at .hoodie level). (b) partitions stashed for deletion will wait in the folder for a week (or time dictated by the policy) before actually getting deleted. In cases, where we realize that something has been accidentally deleted (like a bad policy configuration, TTL exclusion not configured etc), we can always move back from the stash to quickly recover from the TTL event. (c) We shall configure policies for .stashedForDeletion// subfolders to manage for appropriate tiering level (whether to be moved to a warm/cold tier etc) (d) in addition to the deletePartitions() API, which would stash the folder (instead of deleting) based on the configs, we would need a restore API to move the subfolder/files back to their original location. (e) Metadata left by the delete operation to be synced with MDT to keep the file listing metadata in sync with the file system. (In cases where replication to a different region is supported, this also would warrant applying the changes on the replicated copies of data). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8733: [HUDI-6228] Re-enable tests that were flaky before
hudi-bot commented on PR #8733: URL: https://github.com/apache/hudi/pull/8733#issuecomment-1553502569 ## CI report: * d8d892647691c4bdf9f7bb78313db328517d7552 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17145) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17159) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17170) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17175) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17184) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8748: [HUDI-6234] make sure clean is run after flink table service
hudi-bot commented on PR #8748: URL: https://github.com/apache/hudi/pull/8748#issuecomment-1553489437 ## CI report: * bf27bd0b77a73d9e0f101e6d37947c21a86bf47c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17182) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #8278: [SUPPORT] Deltastreamer Fails with AWSDmsAvroPayload
yihua commented on issue #8278: URL: https://github.com/apache/hudi/issues/8278#issuecomment-1553383695 On my side, I verified that I no longer see the exception using the same script you shared after the fix (I was hitting the issue before the fix). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #8278: [SUPPORT] Deltastreamer Fails with AWSDmsAvroPayload
yihua commented on issue #8278: URL: https://github.com/apache/hudi/issues/8278#issuecomment-1553382861 Hi @Hans-Raintree sorry for the delay. This merged fix #8690 on master should fix your issue. Could you give it a try? The fix is included in the upcoming 0.13.1 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.
hudi-bot commented on PR #8758: URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553360233 ## CI report: * 364da977ec98223c416a37128e55ab033782f0b2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org