date:20230518

[GitHub] [hudi] voonhous commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…

2023-05-18 Thread via GitHub



voonhous commented on PR #8755:
URL: https://github.com/apache/hudi/pull/8755#issuecomment-1554103648

   @danny0405 Can you please help to take a look at this PR again, i added more 
tests to the PR. 
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] voonhous commented on a diff in pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…

2023-05-18 Thread via GitHub



voonhous commented on code in PR #8755:
URL: https://github.com/apache/hudi/pull/8755#discussion_r1198612870


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/StatsFileSizeProcedure.scala:
##
@@ -54,8 +55,22 @@ class StatsFileSizeProcedure extends BaseProcedure with 
ProcedureBuilder {
 val globRegex = getArgValueOrDefault(args, 
parameters(1)).get.asInstanceOf[String]
 val limit: Int = getArgValueOrDefault(args, 
parameters(2)).get.asInstanceOf[Int]
 val basePath = getBasePath(table)
-val fs = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build.getFs
-val globPath = String.format("%s/%s/*", basePath, globRegex)
+val metaClient = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+val fs = metaClient.getFs
+val isTablePartitioned = metaClient.getTableConfig.isTablePartitioned
+val maximumPartitionDepth = if (isTablePartitioned) 
metaClient.getTableConfig.getPartitionFields.get.length else 0
+val globPath = (metaClient.getTableConfig.isTablePartitioned, globRegex) 
match {

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8759: Add metrics counters for compaction start/stop events.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8759:
URL: https://github.com/apache/hudi/pull/8759#issuecomment-1554092885

   
   ## CI report:
   
   * fbdd1d299bdf653c65f21c374e0aada9b768318f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17198)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan opened a new pull request, #8764: [HUDI-6240] Adding default value as CORRECTED for rebase modes in write and read for avro

2023-05-18 Thread via GitHub



nsivabalan opened a new pull request, #8764:
URL: https://github.com/apache/hudi/pull/8764

   ### Change Logs
   
   Adding default value as "CORRECTED" for rebase modes in write and read for 
avro, to be used when encountering timestamps older than 1970. 
   
   ### Impact
   
   Will automatically work out of the box, unless user prefers to override 
them. 
   
   ### Risk level (write none, low medium or high below)
   
   low.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

svn commit: r61966 - /release/hudi/KEYS

2023-05-18 Thread yihua

Author: yihua
Date: Fri May 19 06:41:55 2023
New Revision: 61966

Log:
Add GPG key of zhangyue19921010

Modified:
release/hudi/KEYS

Modified: release/hudi/KEYS
==
--- release/hudi/KEYS (original)
+++ release/hudi/KEYS Fri May 19 06:41:55 2023
@@ -1170,4 +1170,63 @@ bTekUhOFAo/Xl12LSY0Wv5c7YEWWgbFH9qfKg5sr
 =pbmq
 -END PGP PUBLIC KEY BLOCK-
 
+pub   rsa4096 2023-05-09 [SC] [expires: 2027-05-09]
+  FE450FF74ABF3594AF8920603A46066B5E92F6B2
+uid [ultimate] zhangyue19921010 
+sig 33A46066B5E92F6B2 2023-05-09  zhangyue19921010 

+sub   rsa4096 2023-05-09 [E] [expires: 2027-05-09]
+sig  3A46066B5E92F6B2 2023-05-09  zhangyue19921010 

+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQINBGRZ5SgBEADGZQ6Ro00rzJJCKNINKfDsl4a0Jam21q1pA8mtMyzx/rSjUSbt
+UTqta5im8KgUdDtJAmPxzxF/97az/SpMHEEfT+csgd+xxHuFBNkpFpgEwIty9djC
+NjHgJb7pk83YeBiAblN3aMovFkUx/PotTxlWvqq6vWEp09K0I9V4zE4aYdWlwizJ
+/ZAVxDvqSSH2sDCBvk7bJC2lMn42+Bb8i/M/8C+9MXXZGOe8HQZsAt1B9HEbtOV4
+nVytMhVlnmKoVlbtzHV8BPfoPNc7sriT5vM1WcqZoxIFclK9x01m32QeyOxNO5fH
+euh7etB+OFpG6yoOf/ml5sgfq/njpVaNrUtd43b/c/fpW9pXAkeXYvr7XpNWsBCr
+wN9XevDzuYZTk6HdDxU8XIYOuCJrCEtOcZBdhrRb5m9t2KF67ZtxFof5W4MUBcFp
+ow6IEAh46syqJDAqg3zRD7G+wB8kJuOpD9yqDqk74PJ0EFA5Ib+ngiPPYalaFb+i
+wCtPuekzbq075H+D2PM9XPqKNmnJNuKg+sJhRmwLForyzG9zi/oUtG37DMvoxwcD
+3k715BUh0475dvV3xqcjb1vPCAw/JPW7iX9lS+k8L+9Z0TZk9tvzv6gYMpqYwS4j
+RRLuBjzV9Et1hAg5ZQNHw2AGhKaWeaWA3GJzOZl4v8+irjAFu1rQLtd5RQARAQAB
+tC56aGFuZ3l1ZTE5OTIxMDEwIDx6aGFuZ3l1ZTE5OTIxMDEwQGFwYWNoZS5vcmc+
+iQJUBBMBCAA+FiEE/kUP90q/NZSviSBgOkYGa16S9rIFAmRZ5SgCGwMFCQeGHekF
+CwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQOkYGa16S9rL5pQ//d9Rx71AdLMq7
+6tsPBQOpuF5IZTxDjU7iX3nC4V6/IKDBHwXJgaUA0NJlDk5IsegxPLnsVTXsVioe
+u1hLljoLkYKEqkyKSSHG+RJEHgwrMXE1L4w9mIrZ/r4rnOcUfXEeIlgh+LhLN2wi
+Uia9T9zsjP6yMWcAtkTZNdLx0hwf2qZ+gZgS13C6sMvGVT8lhqSKGXFiTA3pLya2
+Ambuxwf6EL4NqCxmt6qUQZDAqJjlPTpLHpNgPJtYl8i2l3h3S1L94MLgJL0IzFx9
+9g7PoicyvtstG4R44g1NE6N0kHfDkGQeqHAdDMrFrvIeGGTOst1PincoWA4SQPqy
+RM5CdcU0+JlhlCdVTEkqP7UksWHVrRcsg/n5uFaJPNyfLkDe3d35we0qatchliQd
+7wOM/ufTIBPmz0OjE2pU9wv9KuOdkIRkR1iROrYVH8kgymr4GI1xrVdyr+M3GPoz
+VIUM7a6VWl8ZRW71WisiTE1z4i0WaRpvZ7HprOpvlzFpNO/4ZnOb53iQV3/XduH+
+LN8VuOOiFvhsEYVkRUazMvY5UjjuIL3gNpdeMArT8TdQxgyhINVnfH2iLq4F9ZGk
+ZAO7jfE/HqGDzSXT9StnHok8RJGaB0+smboZHKvV3JduvYlSROBJhCkUKpJBgSzi
+CFmzVkN345GQqvd0MOW4ejPKspoteL+5Ag0EZFnlKAEQANtA3VfPzrYFan3mbr54
++3/7RW759w3Gb1ICVHB4aFv5QQI7+CUHn5zq346YmY2wcxm3QQfF2Prp0NsXLuHp
+aMGlmalhNYUfiAjmBoagou+N3fraV88xwLN5bnYwT+/20/x3ZHHPMpMzphLteTK6
+HhE4kvez67IHpPkBNlKsz91Cl28BDsqN/F4oWcTHkTfI7hXiXJ5tx7t/BjvaAWhA
+lEybEfTdJMu5jO968DwrDYBWxH991fC9kCsnu6T7TIn4oSId+Jp9MDTVCSsOBQON
+67+dEh4tt4FHGFCHImg5lQJEii4no0l2jKAMEqptc7TlWwUkdgxlEG/VI33MFLY3
+svlsPpBhxqqM38ytNluc/tsmUIkJMfq9neH6IxGQOuc6FXtW0e/WUsHW15QO8wEO
+xLU0Q/TpB9C5/ghL1N12teu9jCA8GYzwFjd9cgiBGEpTYcCRVik+K92LYJOj/NSg
+fdJDcNPeetYbnRZUTe6wMlv320nuPy+KzIhWowSSUesBLxLUuCxqOACiomNUPYMN
+g1xN7dgBeWKA3Pagu2iGEKcuuXC/r2UvwxPpZ5ceIF2dISBySdZa+NFjoR/aAUGY
+XgLSKzmRSERaBbuiuaIn69Kvy5swfvi2GDrQAWDjraCEwUKHfG+JZAgH6mI6/RvI
+O/eiShsC47+20Zbs+bQjrkbdABEBAAGJAjwEGAEIACYWIQT+RQ/3Sr81lK+JIGA6
+RgZrXpL2sgUCZFnlKAIbDAUJB4Yd6QAKCRA6RgZrXpL2shn2D/9KBgFECwjJes96
+u/7/Xymrc2SPw1nYaJHCn0KZmCb/3E106dDvQkscR7y5FAw8+/HkV4qjc4Cw1Ewg
+XFOPr78XvHMDGwV54T5Qf8CFYq2qQhYkgTNFEwpWKt6uCQq9dtGhEn6to8lzNWD+
+IcfY+XV7uvZUP5DUbB6GhhpQ04YYComRT+QS2v6ERzrV9Yp8Qdlv6JeUjFJgi2zm
+ON20SQy9Ami+tTOHheQ7yrCn+cc1nicAllZuYDf4anzQJqGw/aFqqdXYcna67eBn
+mxkNoypZNgc0aLqaWrwqg21UKGmglHw516uFJTpzD/V9Xg6hI3rk80bYmNoHfH/s
+SMxhkgIRSHYHVc83HlB3DvAzPSfWWtJKvrlXyyHTaIjXewnQdmF+gdCCcVaBHVZf
+hgeP/Ah5rL9ig7c1Jbh4iSToFKfeYP0CTe0B4FwS2uRZnzlPevQGD2c+2eH+pmZp
+mbR/av6r2QgjC0XjIPSJ/I0WZKmgeEO+c2ZEoktryWUFCA8kgH8kvIa/4LWtF8R7
+lVb3dhPPv/E2S0JUl83D2vXOeSiWV5uQexmIKjJR2i4/sgY4po0osgw22+Rnl3Uo
+0oqAy4DlB31qKt9CVlrj7tNutSe3ZUYznm91e/EIWPsjOoqzV7KtTlzClOO544Jc
+fljHpxb7KmeYv1gjxcM+kcLyZ/89Cw==
+=rzoK
+-END PGP PUBLIC KEY BLOCK-

[jira] [Created] (HUDI-6240) Add default values for rebase modes for avro to handle older timestamps

2023-05-18 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-6240:
-

 Summary: Add default values for rebase modes for avro to handle 
older timestamps
 Key: HUDI-6240
 URL: https://issues.apache.org/jira/browse/HUDI-6240
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] eyjian commented on issue #8757: [SUPPORT] How to get a row of a primary key?

2023-05-18 Thread via GitHub



eyjian commented on issue #8757:
URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554076777

   > You have to use upsert only to use partial update. So with Spark sql you 
must use merge into or update as Insert will act as insert operationType for 
which hudi doesn't guarantee uniqueness.
   
   [By default, if preCombineKey is provided, insert into use upsert as the 
type of write operation, otherwise use 
insert](https://hudi.apache.org/cn/docs/quick-start-guide)，event adding 
"hoodie.datasource.write.operation = 'upsert'" no effect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] eyjian commented on issue #8757: [SUPPORT] How to get a row of a primary key?

2023-05-18 Thread via GitHub



eyjian commented on issue #8757:
URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554065728

   Create table:
   
   ```sql
   CREATE TABLE `test_db`.`t21` (
 `_hoodie_commit_time` STRING,
 `_hoodie_commit_seqno` STRING,
 `_hoodie_record_key` STRING,
 `_hoodie_partition_path` STRING,
 `_hoodie_file_name` STRING,
 `ut` STRING,
 `pk` BIGINT,
 `f0` BIGINT,
 `f1` BIGINT,
 `f2` BIGINT,
 `f3` BIGINT,
 `f4` BIGINT,
 `ds` BIGINT)
   USING hudi
   PARTITIONED BY (ds)
   TBLPROPERTIES (
 'hoodie.bucket.index.num.buckets' = '2',
 'hoodie.datasource.write.payload.class' = 
'org.apache.hudi.common.model.PartialUpdateAvroPayload',
 'hoodie.index.type' = 'BUCKET',
 'primaryKey' = 'pk',
 'type' = 'mor',
 'preCombineField' = 'ut',
 'hoodie.compaction.payload.class' = 
'org.apache.hudi.common.model.PartialUpdateAvroPayload',
 'hoodie.archive.merge.enable' = 'true');
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #8757: [SUPPORT] How to get a row of a primary key?

2023-05-18 Thread via GitHub



ad1happy2go commented on issue #8757:
URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554065593

   You have to use upsert only to use partial update. So with Spark sql you 
must use merge into or update as Insert will act as insert operationType for 
which hudi doesn't guarantee uniqueness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] eyjian commented on issue #8757: [SUPPORT] How to get a row of a primary key?

2023-05-18 Thread via GitHub



eyjian commented on issue #8757:
URL: https://github.com/apache/hudi/issues/8757#issuecomment-1554061049

   > Did you try update or merge into clause?
   
   Thank you. I will try to update and merge, but I need upsert to update a 
wide table. Each rows are in a different parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8638: added new exception types

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8638:
URL: https://github.com/apache/hudi/pull/8638#issuecomment-1554050386

   
   ## CI report:
   
   * c8cf2d86b1be30d3215b3b6e89b8bda33a1fe5dc UNKNOWN
   * 333d9faa53e71ba535a7cb8c60ce8b350a33452c UNKNOWN
   * aa35b5562c16840b5ebf143009beac2c291de2c9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17196)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8749:
URL: https://github.com/apache/hudi/pull/8749#issuecomment-1554045388

   
   ## CI report:
   
   * c8e2c682741b9364ed44c6c70cd3962404daa1e1 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17204)
 
   * 1958203e67af53e5deca919e91208388bfde257c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #7469: [HUDI-5386] Cleaning conflicts when write concurrency mode is OCC

2023-05-18 Thread via GitHub



xushiyan commented on PR #7469:
URL: https://github.com/apache/hudi/pull/7469#issuecomment-1554043487

   @LinMingQiang would you rebase master and resolve conflict pls?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #8200: [MINOR] hoodie.datasource.write.row.writer.enable should set to be true.

2023-05-18 Thread via GitHub



xushiyan commented on PR #8200:
URL: https://github.com/apache/hudi/pull/8200#issuecomment-1554024122

   > > on, I got it, the default value in config is true. But I think it will 
not lead to the differences of sorting results
   > 
   > You can test it，if the value is false , it will create a 
RDDCustomColumnsSortPartitioner who's class description is " A partitioner that 
does sorting based on specified column values for each RDD partition."
   
   Both RDDCustomColumnsSortPartitioner and RowCustomColumnsSortPartitioner 
should sort globally. If you observe sorting issue, then it's a different bug 
to be fixed. Flipping this default value here is irrelevant to sorting issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4370) Support JsonConverter in Kafka Connect sink

2023-05-18 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-4370:
-
Fix Version/s: 0.14.0
   (was: 1.0.0)

> Support JsonConverter in Kafka Connect sink
> ---
>
> Key: HUDI-4370
> URL: https://issues.apache.org/jira/browse/HUDI-4370
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: kafka-connect
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, "org.apache.kafka.connect.json.JsonConverter" is not supported.  
> We need to hook up the logic for converting the json String to Avro record 
> like StringConverter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4388) Strucutred steaming improvements in Hudi streaming Source and Sink

2023-05-18 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-4388:
-
Fix Version/s: 0.14.0

> Strucutred steaming improvements in Hudi streaming Source and Sink
> --
>
> Key: HUDI-4388
> URL: https://issues.apache.org/jira/browse/HUDI-4388
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.14.0, 1.0.0
>
>
> All improvements to structured steaming with HoodieStreamingSink and 
> HoodieStreamSource captured in this epic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-3940) Lock manager does not increment retry count upon exception

2023-05-18 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-3940.
-
Resolution: Fixed

> Lock manager does not increment retry count upon exception
> --
>
> Key: HUDI-3940
> URL: https://issues.apache.org/jira/browse/HUDI-3940
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0, 0.12.3, 0.13.0, 0.12.1, 0.12.0
>
>
> Came up while debugging CI failure: 
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3940) Lock manager does not increment retry count upon exception

2023-05-18 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3940:
--
Fix Version/s: 0.13.1
   0.12.3
   0.13.0
   0.12.1
   0.12.0

> Lock manager does not increment retry count upon exception
> --
>
> Key: HUDI-3940
> URL: https://issues.apache.org/jira/browse/HUDI-3940
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0, 0.12.1, 0.13.0, 0.13.1, 0.12.3, 0.14.0
>
>
> Came up while debugging CI failure: 
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8200: [MINOR] hoodie.datasource.write.row.writer.enable should set to be true.

2023-05-18 Thread via GitHub



nsivabalan commented on code in PR #8200:
URL: https://github.com/apache/hudi/pull/8200#discussion_r1198551786


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -108,7 +108,7 @@ public HoodieWriteMetadata> 
performClustering(final Hood
 Stream> writeStatusesStream = FutureUtils.allOf(
 clusteringPlan.getInputGroups().stream()
 .map(inputGroup -> {
-  if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 false)) {
+  if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 true)) {

Review Comment:
   lets also consider issues like https://github.com/apache/hudi/issues/8259 
before we can make it default. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-18 Thread via GitHub



bvaradar commented on PR #8303:
URL: https://github.com/apache/hudi/pull/8303#issuecomment-1554000815

   @jonvex : Is this ready for review ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3940) Lock manager does not increment retry count upon exception

2023-05-18 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3940:
-
Fix Version/s: 0.14.0
   (was: 1.0.0)

> Lock manager does not increment retry count upon exception
> --
>
> Key: HUDI-3940
> URL: https://issues.apache.org/jira/browse/HUDI-3940
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Came up while debugging CI failure: 
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=8198&view=logs&j=3272dbb2-0925-5f35-bae7-04e75ae62175&t=e3c8a1bc-8efe-5852-1800-3bd561aebfc8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8749:
URL: https://github.com/apache/hudi/pull/8749#issuecomment-1553999096

   
   ## CI report:
   
   * 8bec3af536b80ec5838556f1337d13f06251b0ea Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17178)
 
   * c8e2c682741b9364ed44c6c70cd3962404daa1e1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17204)
 
   * 1958203e67af53e5deca919e91208388bfde257c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-18 Thread via GitHub



rmahindra123 commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1198546865


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java:
##
@@ -93,9 +105,17 @@ public List getTransformersNames() {
   @Override
   public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties) {
 Dataset dataset = rowDataset;
+Option incomingSchemaOpt = sourceSchemaOpt;
+if (!sourceSchemaOpt.isPresent()) {

Review Comment:
   nit: sourceSchemaOpt -> incomingSchemaOpt



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8747: [HUDI-6233] Fix table client conf in AlterTableCommand

2023-05-18 Thread via GitHub



Zouxxyy commented on code in PR #8747:
URL: https://github.com/apache/hudi/pull/8747#discussion_r1198545658


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala:
##
@@ -200,6 +201,13 @@ class TestAlterTable extends HoodieSparkSqlTestBase {
 checkAnswer(s"select id, name, price, ts, dt from $tableName2")(
   Seq(1, "a1", 10.0, 1000, null)
 )
+
+if (HoodieSparkUtils.gteqSpark3_1) {
+  withSQLConf("hoodie.schema.on.read.enable" -> "true") {

Review Comment:
   AlterTableCommand only work on spark datasourcev2 which is controlled by 
`hoodie.schema.on.read.enable`



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala:
##
@@ -200,6 +201,13 @@ class TestAlterTable extends HoodieSparkSqlTestBase {
 checkAnswer(s"select id, name, price, ts, dt from $tableName2")(
   Seq(1, "a1", 10.0, 1000, null)
 )
+
+if (HoodieSparkUtils.gteqSpark3_1) {
+  withSQLConf("hoodie.schema.on.read.enable" -> "true") {

Review Comment:
   @danny0405 AlterTableCommand only work on spark datasourcev2 which is 
controlled by `hoodie.schema.on.read.enable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…

2023-05-18 Thread via GitHub



danny0405 commented on code in PR #8755:
URL: https://github.com/apache/hudi/pull/8755#discussion_r1198544786


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/StatsFileSizeProcedure.scala:
##
@@ -54,8 +55,22 @@ class StatsFileSizeProcedure extends BaseProcedure with 
ProcedureBuilder {
 val globRegex = getArgValueOrDefault(args, 
parameters(1)).get.asInstanceOf[String]
 val limit: Int = getArgValueOrDefault(args, 
parameters(2)).get.asInstanceOf[Int]
 val basePath = getBasePath(table)
-val fs = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build.getFs
-val globPath = String.format("%s/%s/*", basePath, globRegex)
+val metaClient = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+val fs = metaClient.getFs
+val isTablePartitioned = metaClient.getTableConfig.isTablePartitioned
+val maximumPartitionDepth = if (isTablePartitioned) 
metaClient.getTableConfig.getPartitionFields.get.length else 0
+val globPath = (metaClient.getTableConfig.isTablePartitioned, globRegex) 
match {

Review Comment:
   ```suggestion
   val globPath = (isTablePartitioned, globRegex) match {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] BruceKellan commented on a diff in pull request #7561: [HUDI-5477] Optimize timeline loading in Hudi sync client

2023-05-18 Thread via GitHub



BruceKellan commented on code in PR #7561:
URL: https://github.com/apache/hudi/pull/7561#discussion_r1198544567


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java:
##
@@ -210,11 +210,30 @@ public static HoodieDefaultTimeline 
getTimeline(HoodieTableMetaClient metaClient
 return activeTimeline;
   }
 
+  /**
+   * Returns a Hudi timeline with commits after the given instant time 
(exclusive).
+   *
+   * @param metaClient{@link HoodieTableMetaClient} instance.
+   * @param exclusiveStartInstantTime Start instant time (exclusive).
+   * @return Hudi timeline.
+   */
+  public static HoodieTimeline getCommitsTimelineAfter(
+  HoodieTableMetaClient metaClient, String exclusiveStartInstantTime) {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();
+HoodieDefaultTimeline timeline =
+activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime)
+? metaClient.getArchivedTimeline(exclusiveStartInstantTime)
+.mergeTimeline(activeTimeline)
+: activeTimeline;
+return timeline.getCommitsTimeline()
+.findInstantsAfter(exclusiveStartInstantTime, Integer.MAX_VALUE);
+  }

Review Comment:
   @yihua I have a doubt, since rollback and commit are archived separately, is 
it possible that there is a very early rollback instant, causing 
`activeTimeline.isBeforeTimelineStarts(exclusiveStartInstantTime)` to return 
false?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1553993821

   
   ## CI report:
   
   * cc0da2372d50d99c98c2ce4bcbe5a60303bde938 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17195)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime

2023-05-18 Thread via GitHub



danny0405 commented on code in PR #8762:
URL: https://github.com/apache/hudi/pull/8762#discussion_r1198543483


##
hudi-common/src/main/java/org/apache/hudi/exception/HoodieInvalidInstantException.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.exception;
+
+/**
+ * Exception thrown for invalid instants whose name doesn't follow instant 
name format.
+ */
+public class HoodieInvalidInstantException extends HoodieException  {
+
+  public HoodieInvalidInstantException(String msg) {
+super(msg);
+  }
+

Review Comment:
   Not a fan of checked exception, just give a more detailed exception msg 
should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8747: [HUDI-6233] Fix table client conf in AlterTableCommand

2023-05-18 Thread via GitHub



danny0405 commented on code in PR #8747:
URL: https://github.com/apache/hudi/pull/8747#discussion_r1198542011


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTable.scala:
##
@@ -200,6 +201,13 @@ class TestAlterTable extends HoodieSparkSqlTestBase {
 checkAnswer(s"select id, name, price, ts, dt from $tableName2")(
   Seq(1, "a1", 10.0, 1000, null)
 )
+
+if (HoodieSparkUtils.gteqSpark3_1) {
+  withSQLConf("hoodie.schema.on.read.enable" -> "true") {

Review Comment:
   Can you explain a little more why we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3049) Use flink table name as default synced hive table name

2023-05-18 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3049:
-
Fix Version/s: 0.14.0

> Use flink table name as default synced hive table name
> --
>
> Key: HUDI-3049
> URL: https://issues.apache.org/jira/browse/HUDI-3049
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3409) Expose Timeline Server Metrics

2023-05-18 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3409:
-
Fix Version/s: 0.14.0
   (was: 1.0.0)

> Expose Timeline Server Metrics
> --
>
> Key: HUDI-3409
> URL: https://issues.apache.org/jira/browse/HUDI-3409
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server
>Reporter: DarAmani Swift
>Assignee: Rajesh
>Priority: Major
>  Labels: new-to-hudi
> Fix For: 0.14.0
>
>
> Timeline server metrics are pushed to local registry but never going to 
> reporters. Exposing these metrics would greatly improve debugging latency 
> around async processes and timeline server syncs. 
> Metrics are already captured in the [Request 
> Handler|https://github.com/apache/hudi/blob/master/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java#L527-L531]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6208) Fix jetty conflicts in the packaging process

2023-05-18 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6208.

Fix Version/s: 0.14.0
   Resolution: Fixed

Fixed via master branch: 0d55c9d4a93957b0cbdbc4e7a6b3cf79e8d348fe

> Fix jetty conflicts in the packaging process
> 
>
> Key: HUDI-6208
> URL: https://issues.apache.org/jira/browse/HUDI-6208
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: timeline-server
>Affects Versions: 0.14.0
> Environment: hudi-master
>Reporter: eric
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: image-2023-05-15-09-48-18-179.png
>
>
> !image-2023-05-15-09-48-18-179.png!
>  
>  
> [[HUDI-6208]Fix jetty conflicts in the packaging process by eric9204 · Pull 
> Request #8706 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/pull/8706]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (0b87e143cfe -> 0d55c9d4a93)

2023-05-18 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 0b87e143cfe [HUDI-6115] Adding hardening checks for transformer output 
schema for quarantine enabled/disabled (#8520)
 add 0d55c9d4a93 [HUDI-6208] Fix jetty conflicts in the packaging process 
(#8706)

No new revisions were added by this update.

Summary of changes:
 hudi-timeline-service/pom.xml | 6 ++
 1 file changed, 6 insertions(+)

[GitHub] [hudi] danny0405 merged pull request #8706: [HUDI-6208] Fix jetty conflicts in the packaging process

2023-05-18 Thread via GitHub



danny0405 merged PR #8706:
URL: https://github.com/apache/hudi/pull/8706


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #8760: [HUDI-6238] Disabling clustering for single file group

2023-05-18 Thread via GitHub



danny0405 commented on PR #8760:
URL: https://github.com/apache/hudi/pull/8760#issuecomment-1553986355

   >  since the stats are going to remain intact before and after sorting 
(total valid values, min and max). So, even when sorting is enabled, we should 
not trigger clustering when file group count is just 1
   
   I don't think so, the sorting is not for column stats, it is for query 
optimization, when the parquet file is sorted, less column group cound be 
touched while filtering the columns by filter predicates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #8745: [HUDI-6182] Hive sync use state transient time to avoid losing partit…

2023-05-18 Thread via GitHub



danny0405 commented on PR #8745:
URL: https://github.com/apache/hudi/pull/8745#issuecomment-1553984735

   Ping me again when it is ready to reivew.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #7359: [HUDI-3304] WIP - Allow selective partial update

2023-05-18 Thread via GitHub



xushiyan commented on PR #7359:
URL: https://github.com/apache/hudi/pull/7359#issuecomment-1553974629

   @bschell are you still working on this? the title says WIP


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8763: [HUDI-6239] fix clustering pool scheduler conf not take effect bug

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8763:
URL: https://github.com/apache/hudi/pull/8763#issuecomment-1553970163

   
   ## CI report:
   
   * 64e77789e493cf252accab22fec1267c9402009f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8762:
URL: https://github.com/apache/hudi/pull/8762#issuecomment-1553970142

   
   ## CI report:
   
   * 9a2b1000c85524b5b541b4fc2d4d0b14eca30b44 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8749:
URL: https://github.com/apache/hudi/pull/8749#issuecomment-1553970049

   
   ## CI report:
   
   * 8bec3af536b80ec5838556f1337d13f06251b0ea Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17178)
 
   * c8e2c682741b9364ed44c6c70cd3962404daa1e1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17204)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on pull request #8759: Add metrics counters for compaction start/stop events.

2023-05-18 Thread via GitHub



SteNicholas commented on PR #8759:
URL: https://github.com/apache/hudi/pull/8759#issuecomment-1553967689

   @amrishlal, I don't think it's necessary to introduce the metrics in this 
pull request. The essence of demand in description is that monitor the 
different state of compaction action. Therefore, we should introduce different 
state metric of compaction action in state change phase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8763: [HUDI-6239] fix clustering pool scheduler conf not take effect bug

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8763:
URL: https://github.com/apache/hudi/pull/8763#issuecomment-1553965685

   
   ## CI report:
   
   * 64e77789e493cf252accab22fec1267c9402009f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8762:
URL: https://github.com/apache/hudi/pull/8762#issuecomment-1553965656

   
   ## CI report:
   
   * 9a2b1000c85524b5b541b4fc2d4d0b14eca30b44 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8749: [HUDI-6235] Update and Delete statements for Flink

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8749:
URL: https://github.com/apache/hudi/pull/8749#issuecomment-1553965553

   
   ## CI report:
   
   * 8bec3af536b80ec5838556f1337d13f06251b0ea Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17178)
 
   * c8e2c682741b9364ed44c6c70cd3962404daa1e1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8303:
URL: https://github.com/apache/hudi/pull/8303#issuecomment-1553964669

   
   ## CI report:
   
   * b8772a74388873c35b1a13ba6ef99ecda9246646 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17165)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17203)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer

2023-05-18 Thread Kong Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei updated HUDI-6239:
---
Status: In Progress  (was: Open)

> cluster-scheduling-weight and cluster-scheduling-minShare not take effect in 
> deltastreamer
> --
>
> Key: HUDI-6239
> URL: https://issues.apache.org/jira/browse/HUDI-6239
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Minor
> Attachments: image-2023-05-19-11-04-45-541.png, 
> image-2023-05-19-11-05-41-056.png
>
>
> In the method 
> org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig,
>  it will generate the spark scheduler conf for deltasync, compaction and 
> clustering.
> But the clustering scheduler conf will not take effect.
> The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool
> !image-2023-05-19-11-04-45-541.png!
> While the generateConfig take 3 pool as parameter.
> !image-2023-05-19-11-05-41-056.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

2023-05-18 Thread via GitHub



SteNicholas commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1198516655


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.
+2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. 
When sub partition count exceeds, delete the partitions with smaller partition 
values until the sub-partition count meets the policy configuration.
+3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum 
of the data size of all sub-partitions does not exceed the policy configuration.
+2. User need to set different policies for different partitions. For example, 
the hudi table is partitioned by two fields (user_id, ts). For 
partition(user_id='1'), we set the policy to keep 100G data for all 
sub-partitions, and for partition(user_id='2') we set the policy that all 
sub-partitions will expire 10 days after their last modified time.
+3. It's possible that there are a lot of high-level partitions in the user's 
table,  and they don't want to set TTL policies for all the high-level 
partitions. So we need to provide a default policy mechanism so that users can 
set a default policy for all high-level partitions and add some explicit 
policies for some of them if needed. Explicit policies will override the 
default policy.
+
+So here we have the TTL policy definition:
+```java
+public class HoodiePartitionTTLPolicy {
+  public enum TTLPolicy {
+KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT
+  }
+
+  // Partition spec for which the policy takes effect
+  private String partitionSpec;
+
+  private TTLPolicy policy;
+
+  private long policyValue;
+}
+```
+
+### User Interface for TTL policy
+Users can config partition TTL management policies through SparkSQL Call 
Command and through table config directly.  Assume that the user has a hudi 
table partitioned by two fields(user_id, ts), he can config partition TTL 
policies as follows.
+
+```sql
+-- Set default policy for all user_id, which keeps the data for 30 days.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 
'KEEP_BY_TIME', policyValue => '30');

Review Comment:
   @stream2000, could the `add_ttl_policy` procedure add the `type` to specify 
the ttl policy type, which value could be `partition` etc?



##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. Ho

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

2023-05-18 Thread via GitHub



SteNicholas commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r119851


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.
+2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. 
When sub partition count exceeds, delete the partitions with smaller partition 
values until the sub-partition count meets the policy configuration.
+3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum 
of the data size of all sub-partitions does not exceed the policy configuration.
+2. User need to set different policies for different partitions. For example, 
the hudi table is partitioned by two fields (user_id, ts). For 
partition(user_id='1'), we set the policy to keep 100G data for all 
sub-partitions, and for partition(user_id='2') we set the policy that all 
sub-partitions will expire 10 days after their last modified time.
+3. It's possible that there are a lot of high-level partitions in the user's 
table,  and they don't want to set TTL policies for all the high-level 
partitions. So we need to provide a default policy mechanism so that users can 
set a default policy for all high-level partitions and add some explicit 
policies for some of them if needed. Explicit policies will override the 
default policy.
+
+So here we have the TTL policy definition:
+```java
+public class HoodiePartitionTTLPolicy {
+  public enum TTLPolicy {
+KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT
+  }
+
+  // Partition spec for which the policy takes effect
+  private String partitionSpec;
+
+  private TTLPolicy policy;
+
+  private long policyValue;
+}
+```
+
+### User Interface for TTL policy
+Users can config partition TTL management policies through SparkSQL Call 
Command and through table config directly.  Assume that the user has a hudi 
table partitioned by two fields(user_id, ts), he can config partition TTL 
policies as follows.
+
+```sql
+-- Set default policy for all user_id, which keeps the data for 30 days.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 
'KEEP_BY_TIME', policyValue => '30');
+ 
+--For partition user_id=1/, keep 10 sub partitions.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=1/', policy => 
'KEEP_BY_COUNT', policyValue => '10');
+
+--For partition user_id=2/, keep 100GB data in total
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=2/', policy => 
'KEEP_BY_SIZE', policyValue => '107374182400');
+
+--For partition user_id=3/, keep the data for 7 day.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=3/', policy => 
'KEEP_BY_TIME', policyValue => '7');
+
+-- Show all the TTL policies including default and explicit policies
+call show_ttl_policies(table => 'test');
+user_id=*/ KEEP_BY_TIME30
+user_id=1/ KEEP_BY_COUNT   10
+user_id=2/ KEEP_BY_SIZE107374182400
+user_id=3/ KEEP_BY_TIME7
+```
+
+### Storage for TTL policy
+The partition TTL policies will be stored in `hoodie.properties`since it is 
part of table metadata. The policy configs in `hoodie.properties`are defined as 
follows. Explicit policies are defined using a JSON array while default policy 
is de

[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8303:
URL: https://github.com/apache/hudi/pull/8303#issuecomment-1553959688

   
   ## CI report:
   
   * b8772a74388873c35b1a13ba6ef99ecda9246646 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] waitingF opened a new pull request, #8763: [HUDI-6239] fix clustering pool scheduler conf not take effect bug

2023-05-18 Thread via GitHub



waitingF opened a new pull request, #8763:
URL: https://github.com/apache/hudi/pull/8763

   ### Change Logs
   
   In the method 
org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, 
it will generate the spark scheduler conf for deltasync, compaction and 
clustering.
   
   But the clustering scheduler conf will not take effect.
   
   The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool
   
![image](https://github.com/apache/hudi/assets/19326824/2673af27-e6b9-42cc-88ff-d3deab21793b)
   While the generateConfig take 3 pool as parameter.
   
![image](https://github.com/apache/hudi/assets/19326824/67ca6e43-3e49-4724-8e64-d69cb57d9991)
   
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

2023-05-18 Thread via GitHub



SteNicholas commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1198516655


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.
+2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. 
When sub partition count exceeds, delete the partitions with smaller partition 
values until the sub-partition count meets the policy configuration.
+3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum 
of the data size of all sub-partitions does not exceed the policy configuration.
+2. User need to set different policies for different partitions. For example, 
the hudi table is partitioned by two fields (user_id, ts). For 
partition(user_id='1'), we set the policy to keep 100G data for all 
sub-partitions, and for partition(user_id='2') we set the policy that all 
sub-partitions will expire 10 days after their last modified time.
+3. It's possible that there are a lot of high-level partitions in the user's 
table,  and they don't want to set TTL policies for all the high-level 
partitions. So we need to provide a default policy mechanism so that users can 
set a default policy for all high-level partitions and add some explicit 
policies for some of them if needed. Explicit policies will override the 
default policy.
+
+So here we have the TTL policy definition:
+```java
+public class HoodiePartitionTTLPolicy {
+  public enum TTLPolicy {
+KEEP_BY_TIME, KEEP_BY_SIZE, KEEP_BY_COUNT
+  }
+
+  // Partition spec for which the policy takes effect
+  private String partitionSpec;
+
+  private TTLPolicy policy;
+
+  private long policyValue;
+}
+```
+
+### User Interface for TTL policy
+Users can config partition TTL management policies through SparkSQL Call 
Command and through table config directly.  Assume that the user has a hudi 
table partitioned by two fields(user_id, ts), he can config partition TTL 
policies as follows.
+
+```sql
+-- Set default policy for all user_id, which keeps the data for 30 days.
+call add_ttl_policy(table => 'test', partitionSpec => 'user_id=*/', policy => 
'KEEP_BY_TIME', policyValue => '30');

Review Comment:
   Could the `add_ttl_policy` procedure add the `type` to specify the ttl 
policy type, which value could be `partition` etc.



##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users

[GitHub] [hudi] SteNicholas commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

2023-05-18 Thread via GitHub



SteNicholas commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1198516174


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.
+2. **KEEP_BY_COUNT**. Keep N sub-partitions for a  high-level partition. 
When sub partition count exceeds, delete the partitions with smaller partition 
values until the sub-partition count meets the policy configuration.
+3. **KEEP_BY_SIZE**. Similar to KEEP_BY_COUNT, but to ensure that the sum 
of the data size of all sub-partitions does not exceed the policy configuration.
+2. User need to set different policies for different partitions. For example, 
the hudi table is partitioned by two fields (user_id, ts). For 
partition(user_id='1'), we set the policy to keep 100G data for all 
sub-partitions, and for partition(user_id='2') we set the policy that all 
sub-partitions will expire 10 days after their last modified time.
+3. It's possible that there are a lot of high-level partitions in the user's 
table,  and they don't want to set TTL policies for all the high-level 
partitions. So we need to provide a default policy mechanism so that users can 
set a default policy for all high-level partitions and add some explicit 
policies for some of them if needed. Explicit policies will override the 
default policy.
+
+So here we have the TTL policy definition:
+```java
+public class HoodiePartitionTTLPolicy {

Review Comment:
   Could we introduce `HoodieTTLPolicy` interface? Then 
`HoodiePartitionTTLPolicy` implements the `HoodieTTLPolicy`. 
`HoodieRecordTTLPolicy` could also implement this interface in feature.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

2023-05-18 Thread via GitHub



SteNicholas commented on PR #8062:
URL: https://github.com/apache/hudi/pull/8062#issuecomment-1553957249

   @stream2000, could we also introduce record ttl management? Partition ttl 
management and record ttl management both need the ttl policy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs opened a new pull request, #8762: [HUDI-5517][FOLLOW-UP] Refine API names and ensure time travel won't affect by stateTransitionTime

2023-05-18 Thread via GitHub



boneanxs opened a new pull request, #8762:
URL: https://github.com/apache/hudi/pull/8762

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   1. Avoid having acronym in the api
   2. Give more context for the exception
   3. Ensure time travel won't affect by this
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   none
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   none
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer

2023-05-18 Thread Kong Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei updated HUDI-6239:
---
Priority: Minor  (was: Major)

> cluster-scheduling-weight and cluster-scheduling-minShare not take effect in 
> deltastreamer
> --
>
> Key: HUDI-6239
> URL: https://issues.apache.org/jira/browse/HUDI-6239
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Minor
> Attachments: image-2023-05-19-11-04-45-541.png, 
> image-2023-05-19-11-05-41-056.png
>
>
> In the method 
> org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig,
>  it will generate the spark scheduler conf for deltasync, compaction and 
> clustering.
> But the clustering scheduler conf will not take effect.
> The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool
> !image-2023-05-19-11-04-45-541.png!
> While the generateConfig take 3 pool as parameter.
> !image-2023-05-19-11-05-41-056.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8755:
URL: https://github.com/apache/hudi/pull/8755#issuecomment-1553940371

   
   ## CI report:
   
   * 2b0ddb3813e46f5f71a357f1fc2191801b17beb6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17191)
 
   * 9792b4220f6fc0700975a3883e19336d21457020 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17202)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer

2023-05-18 Thread Kong Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kong Wei reassigned HUDI-6239:
--

Assignee: Kong Wei

> cluster-scheduling-weight and cluster-scheduling-minShare not take effect in 
> deltastreamer
> --
>
> Key: HUDI-6239
> URL: https://issues.apache.org/jira/browse/HUDI-6239
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
> Attachments: image-2023-05-19-11-04-45-541.png, 
> image-2023-05-19-11-05-41-056.png
>
>
> In the method 
> org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig,
>  it will generate the spark scheduler conf for deltasync, compaction and 
> clustering.
> But the clustering scheduler conf will not take effect.
> The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool
> !image-2023-05-19-11-04-45-541.png!
> While the generateConfig take 3 pool as parameter.
> !image-2023-05-19-11-05-41-056.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6239) cluster-scheduling-weight and cluster-scheduling-minShare not take effect in deltastreamer

2023-05-18 Thread Kong Wei (Jira)

Kong Wei created HUDI-6239:
--

 Summary: cluster-scheduling-weight and cluster-scheduling-minShare 
not take effect in deltastreamer
 Key: HUDI-6239
 URL: https://issues.apache.org/jira/browse/HUDI-6239
 Project: Apache Hudi
  Issue Type: Bug
  Components: deltastreamer
Reporter: Kong Wei
 Attachments: image-2023-05-19-11-04-45-541.png, 
image-2023-05-19-11-05-41-056.png

In the method 
org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator#generateConfig, 
it will generate the spark scheduler conf for deltasync, compaction and 
clustering.

But the clustering scheduler conf will not take effect.

The SPARK_SCHEDULING_PATTERN only contain 2 scheduler pool

!image-2023-05-19-11-04-45-541.png!

While the generateConfig take 3 pool as parameter.

!image-2023-05-19-11-05-41-056.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8755:
URL: https://github.com/apache/hudi/pull/8755#issuecomment-1553936360

   
   ## CI report:
   
   * 2b0ddb3813e46f5f71a357f1fc2191801b17beb6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17191)
 
   * 9792b4220f6fc0700975a3883e19336d21457020 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] samserpoosh opened a new issue, #8761: [SUPPORT] "Illegal Lambda Deserialization" When Leveraging PostgresDebeziumSource

2023-05-18 Thread via GitHub



samserpoosh opened a new issue, #8761:
URL: https://github.com/apache/hudi/issues/8761

   ### Describe The Problem You Faced**
   
   I'm trying to get Postgres CDC events published to Kafka by Debezium 
ingested into a partitioned Hudi Table in S3. I'm currently testing this E2E 
Data Flow using a dummy and pretty simple DB Table. When submitting the 
DeltaStreamer job, it throws the "**Illegal Lambda Deserialization**" exception.
   
   ### To Reproduce
   
   - My `spark-submit` Command:
   
   ```
   spark-submit \
   --jars "opt/spark/jars/hudi-utils-bundle.jar,..." \
   --master spark://:7077 \
   --total-executor-cores 1 \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --conf spark.hadoop.fs.s3a.connection.maximum=1 \
   --conf spark.scheduler.mode=FAIR \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
opt/spark/jars/hudi-utils-bundle.jar \
   --table-type COPY_ON_WRITE \
   --target-base-path s3a://path/to/samser_customers \
   --target-table samser_customers \
   --min-sync-interval-seconds 30 \
   --source-class 
org.apache.hudi.utilities.sources.debezium.PostgresDebeziumSource \
   --payload-class 
org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload \
   --source-ordering-field _event_lsn \
   --op UPSERT \
   --continuous \
   --source-limit 5000 \
   --hoodie-conf bootstrap.servers=:9092 \
   --hoodie-conf group.id= \
   --hoodie-conf schema.registry.url=http://:8081 \
   --hoodie-conf 
hoodie.deltastreamer.schemaprovider.registry.url=http://:8081/subjects/-value/versions/1
 \
   --hoodie-conf 
hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer
 \
   --hoodie-conf hoodie.deltastreamer.source.kafka.topic= \
   --hoodie-conf auto.offset.reset=earliest \
   --hoodie-conf hoodie.datasource.write.recordkey.field=id \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=name \
   --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
   --hoodie-conf hoodie.datasource.write.precombine.field=_event_lsn \
   --hoodie-conf hoodie.metadata.enable=true \
   --hoodie-conf hoodie.metadata.index.column.stats.enable=true \
   --hoodie-conf hoodie.parquet.small.file.limit=134217728
   ```
   
   - Relevant Debezium-PG Configuration:
   
   ```
   class: io.debezium.connector.postgresql.PostgresConnector


 
   plugin.name: pgoutput


 
   database.hostname:  


 
   database.port: 5432  


 
   database.user:  


 
   database.password:   


 
   database.dbname :   


 
   topic.prefix:  


 
   schema.include.list: public  


 
   key.converter: io.confluent.connect.avro.AvroConverter   


 
   key.converter.schema.registry.url: http://:8081

[GitHub] [hudi] c-f-cooper commented on issue #8651: [SUPPORT]How to resolve small file?

2023-05-18 Thread via GitHub



c-f-cooper commented on issue #8651:
URL: https://github.com/apache/hudi/issues/8651#issuecomment-1553931467

   
![image](https://github.com/apache/hudi/assets/25735549/8b816399-ede9-4a2a-97b5-d28e7ef3b1e4)
   
   
![023D4646-7D12-4606-8188-0F1A05DE47C5_1_102_o](https://github.com/apache/hudi/assets/25735549/af7586f4-5be8-4efd-a895-d56108c8c878)
   
   I found that the async cluster shedule done,but not execute @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8747: [HUDI-6233] Fix table client conf in AlterTableCommand

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8747:
URL: https://github.com/apache/hudi/pull/8747#issuecomment-1553931308

   
   ## CI report:
   
   * 72b2d6da4377e18a600857d9ae6eb2766c786c12 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17192)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8752: [HUDI-6236] write hive_style_partitioning_enable to table config in D…

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8752:
URL: https://github.com/apache/hudi/pull/8752#issuecomment-1553931378

   
   ## CI report:
   
   * 7762747d22f8ffade79936aa3465db3ce89045db UNKNOWN
   * f9b5f2d4727ffabc20e3e28e78e49009f6fa221e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17189)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] voonhous commented on pull request #8752: [HUDI-6236] write hive_style_partitioning_enable to table config in D…

2023-05-18 Thread via GitHub



voonhous commented on PR #8752:
URL: https://github.com/apache/hudi/pull/8752#issuecomment-1553924948

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] eric9204 commented on a diff in pull request #8706: [HUDI-6208] Fix jetty conflicts in the packaging process

2023-05-18 Thread via GitHub



eric9204 commented on code in PR #8706:
URL: https://github.com/apache/hudi/pull/8706#discussion_r1198487611


##
hudi-timeline-service/pom.xml:
##
@@ -87,6 +87,12 @@
   kryo-shaded
 
 
+
+  org.eclipse.jetty
+  jetty-util
+  ${jetty.version}

Review Comment:
   Thanks for ur help!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a diff in pull request #7627: [HUDI-5517] HoodieTimeline support filter instants by state transition time

2023-05-18 Thread via GitHub



boneanxs commented on code in PR #7627:
URL: https://github.com/apache/hudi/pull/7627#discussion_r1198483774


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestStreamingSource.scala:
##
@@ -51,45 +53,46 @@ class TestStreamingSource extends StreamTest {
 withTempDir { inputDir =>
   val tablePath = s"${inputDir.getCanonicalPath}/test_cow_stream"
   HoodieTableMetaClient.withPropertyBuilder()
-  .setTableType(COPY_ON_WRITE)
-  .setTableName(getTableName(tablePath))
-  
.setPayloadClassName(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.defaultValue)
+.setTableType(COPY_ON_WRITE)
+.setTableName(getTableName(tablePath))
+
.setPayloadClassName(DataSourceWriteOptions.PAYLOAD_CLASS_NAME.defaultValue)
 .setPreCombineField("ts")
-  .initTable(spark.sessionState.newHadoopConf(), tablePath)
+.initTable(spark.sessionState.newHadoopConf(), tablePath)
 
   addData(tablePath, Seq(("1", "a1", "10", "000")))
   val df = spark.readStream
 .format("org.apache.hudi")
+.option(DataSourceReadOptions.READ_BY_STATE_TRANSITION_TIME.key(), 
useTransitionTime)

Review Comment:
   By default `useTransitionTime` is false, and this test covers the default 
commit instant time, while `TestStreamSourceReadByStateTransitionTime` extends 
this class and override `useTransitionTime` to true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6115] Adding hardening checks for transformer output schema for quarantine enabled/disabled (#8520)

2023-05-18 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0b87e143cfe [HUDI-6115] Adding hardening checks for transformer output 
schema for quarantine enabled/disabled (#8520)
0b87e143cfe is described below

commit 0b87e143cfe237ddc005f610d208d1bf36432ba3
Author: harshal 
AuthorDate: Fri May 19 07:02:11 2023 +0530

[HUDI-6115] Adding hardening checks for transformer output schema for 
quarantine enabled/disabled (#8520)

- Adds ERROR_TABLE_CURRUPT_RECORD_COL_NAME as a null
   value column if the error table is enabled for transformers and the
   column does not exist in the dataset.
- Adds validation for ERROR_TABLE_CURRUPT_RECORD_COL_NAME
   column to be part of the transformer in cases of error table is 
enabled/disabled.
---
 .../org/apache/hudi/utilities/UtilHelpers.java |   7 +-
 .../hudi/utilities/deltastreamer/DeltaSync.java|   4 +-
 .../utilities/deltastreamer/ErrorTableUtils.java   |  33 -
 .../utilities/transform/ChainedTransformer.java|   8 +-
 .../ErrorTableAwareChainedTransformer.java |  58 
 .../TestErrorTableAwareChainedTransformer.java | 150 +
 6 files changed, 250 insertions(+), 10 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
index 721ba2eb9f4..16ed7eadc1f 100644
--- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
+++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
@@ -61,6 +61,7 @@ import 
org.apache.hudi.utilities.schema.postprocessor.ChainedSchemaPostProcessor
 import org.apache.hudi.utilities.sources.Source;
 import 
org.apache.hudi.utilities.sources.processor.ChainedJsonKafkaSourcePostProcessor;
 import 
org.apache.hudi.utilities.sources.processor.JsonKafkaSourcePostProcessor;
+import org.apache.hudi.utilities.transform.ErrorTableAwareChainedTransformer;
 import org.apache.hudi.utilities.transform.ChainedTransformer;
 import org.apache.hudi.utilities.transform.Transformer;
 
@@ -190,9 +191,11 @@ public class UtilHelpers {
 
   }
 
-  public static Option createTransformer(Option> 
classNamesOpt) throws IOException {
+  public static Option createTransformer(Option> 
classNamesOpt, Boolean isErrorTableWriterEnabled) throws IOException {
 try {
-  return classNamesOpt.map(classNames -> classNames.isEmpty() ? null : new 
ChainedTransformer(classNames));
+  return classNamesOpt.map(classNames -> classNames.isEmpty() ? null : 
+  isErrorTableWriterEnabled ? new 
ErrorTableAwareChainedTransformer(classNames) : new 
ChainedTransformer(classNames)
+  );
 } catch (Throwable e) {
   throw new IOException("Could not load transformer class(es) " + 
classNamesOpt.get(), e);
 }
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
index 7d1d0758955..cbd19305e41 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -289,7 +289,6 @@ public class DeltaSync implements Serializable, Closeable {
 // Register User Provided schema first
 registerAvroSchemas(schemaProvider);
 
-this.transformer = 
UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames));
 
 this.metrics = (HoodieIngestionMetrics) 
ReflectionUtils.loadClass(cfg.ingestionMetricsClass, 
getHoodieClientConfig(this.schemaProvider));
 this.hoodieMetrics = new 
HoodieMetrics(getHoodieClientConfig(this.schemaProvider));
@@ -306,6 +305,9 @@ public class DeltaSync implements Serializable, Closeable {
 this.formatAdapter = new SourceFormatAdapter(
 UtilHelpers.createSource(cfg.sourceClassName, props, jssc, 
sparkSession, schemaProvider, metrics),
 this.errorTableWriter, Option.of(props));
+
+this.transformer = 
UtilHelpers.createTransformer(Option.ofNullable(cfg.transformerClassNames), 
this.errorTableWriter.isPresent());
+
   }
 
   /**
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java
index 881a9545461..76e7b030b6f 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/ErrorTableUtils.java
@@ -28,24 +28,29 @@ import org.apache.hudi.config.HoodieErrorTableConfig;
 import org.apache.hudi.exception.HoodieException;
 
 import org.apache.hadoop.fs.FileSystem;
+import org.apac

[GitHub] [hudi] codope merged pull request #8520: [HUDI-6115] Hardening expectation of corruptRecordColumn in ChainedTransformer.

2023-05-18 Thread via GitHub



codope merged PR #8520:
URL: https://github.com/apache/hudi/pull/8520


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #8714: [HUDI-6212] Hudi spark 3.0.x adoption

2023-05-18 Thread via GitHub



yihua commented on code in PR #8714:
URL: https://github.com/apache/hudi/pull/8714#discussion_r1198414998


##
.github/workflows/bot.yml:
##
@@ -63,6 +63,10 @@ jobs:
 sparkProfile: "spark3.1"
 sparkModules: "hudi-spark-datasource/hudi-spark3.1.x"
 
+  - scalaProfile: "scala-2.12"

Review Comment:
   Would be good to add bundle validation on Spark 3.0.x too in 
`validate-bundles` section



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -114,58 +114,65 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
 })
   }
 
-  test("Test MergeInto with more than once update actions") {
-withRecordType()(withTempDir {tmp =>
-  val targetTable = generateTableName
-  spark.sql(
-s"""
-   |create table ${targetTable} (
-   |  id int,
-   |  name string,
-   |  data int,
-   |  country string,
-   |  ts bigint
-   |) using hudi
-   |tblproperties (
-   |  type = 'cow',
-   |  primaryKey = 'id',
-   |  preCombineField = 'ts'
-   | )
-   |partitioned by (country)
-   |location '${tmp.getCanonicalPath}/$targetTable'
-   |""".stripMargin)
-  spark.sql(
-s"""
-   |merge into ${targetTable} as target
-   |using (
-   |select 1 as id, 'lb' as name, 6 as data, 'shu' as country, 
1646643193 as ts
-   |) source
-   |on source.id = target.id
-   |when matched then
-   |update set *
-   |when not matched then
-   |insert *
-   |""".stripMargin)
-  spark.sql(
-s"""
-   |merge into ${targetTable} as target
-   |using (
-   |select 1 as id, 'lb' as name, 5 as data, 'shu' as country, 
1646643196 as ts
-   |) source
-   |on source.id = target.id
-   |when matched and source.data > target.data then
-   |update set target.data = source.data, target.ts = source.ts
-   |when matched and source.data = 5 then
-   |update set target.data = source.data, target.ts = source.ts
-   |when not matched then
-   |insert *
-   |""".stripMargin)
-
-  checkAnswer(s"select id, name, data, country, ts from $targetTable")(
-Seq(1, "lb", 5, "shu", 1646643196L)
-  )
+  /**
+   * For spark3.0.x didn't support 'UPDATE and DELETE can appear at most once 
in MATCHED clauses in a MERGE statement'
+   * details: 
org.apache.spark.sql.catalyst.parser.AstBuilder#visitMergeIntoTable

Review Comment:
   ```suggestion
  * In Spark 3.0.x, UPDATE and DELETE can appear at most once in MATCHED 
clauses in a MERGE INTO statement.
  * Refer to: 
`org.apache.spark.sql.catalyst.parser.AstBuilder#visitMergeIntoTable`
   ```



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/avro/TestAvroSerDe.scala:
##
@@ -20,11 +20,12 @@ package org.apache.spark.sql.avro
 import org.apache.avro.generic.GenericData
 import org.apache.hudi.SparkAdapterSupport
 import org.apache.hudi.avro.model.{HoodieMetadataColumnStats, IntWrapper}
+import org.apache.spark.internal.Logging
 import org.apache.spark.sql.avro.SchemaConverters.SchemaType
 import org.junit.jupiter.api.Assertions.assertEquals
 import org.junit.jupiter.api.Test
 
-class TestAvroSerDe extends SparkAdapterSupport {
+class TestAvroSerDe extends SparkAdapterSupport with Logging {

Review Comment:
   Is this necessary for testing?



##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/HoodieSpark3CatalystExpressionUtils.scala:
##
@@ -17,16 +17,9 @@
 
 package org.apache.spark.sql
 
-import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, 
Expression, Predicate, PredicateHelper}
-import org.apache.spark.sql.execution.datasources.DataSourceStrategy
-
-trait HoodieSpark3CatalystExpressionUtils extends HoodieCatalystExpressionUtils
-  with PredicateHelper {
-
-  override def normalizeExprs(exprs: Seq[Expression], attributes: 
Seq[Attribute]): Seq[Expression] =
-DataSourceStrategy.normalizeExprs(exprs, attributes)
-
-  override def extractPredicatesWithinOutputSet(condition: Expression,
-outputSet: AttributeSet): 
Option[Expression] =
-super[PredicateHelper].extractPredicatesWithinOutputSet(condition, 
outputSet)
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, 
Expression}
+abstract class HoodieSpark3CatalystExpressionUtils extends 
HoodieCatalystExpressionUtils {

Review Comment:
   could we keep `with PredicateHelper`?



##
hudi-spark-datasource/hudi-spark3.0.x/src/main/java/org/apache/spark/sql/execution/datasources/parquet/Spark30HoodieVectorizedParquetRecordReader.java:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache So

[GitHub] [hudi] fujianhua168 commented on issue #8754: [SUPPORT] PrestoDB encountered data quality issues while reading the Hudi Mor table.

2023-05-18 Thread via GitHub



fujianhua168 commented on issue #8754:
URL: https://github.com/apache/hudi/issues/8754#issuecomment-1553872378

   > @fujianhua168 This is a known issue that log files are not read during 
compation by the connector. I am working on the fix and will put up a patch 
early next week. It should be fixed in next Presto release. However, for Trino, 
currently we don't support MOR snapshot query. It's still under review.
   
   First of all, thank you very much for your contribution. In fact, Trino's 
support for snapshot queries is urgently needed by us (note: many data 
developers in our company use Trino). Currently, we are unable to query the 
Hudi mor table through Trino, and we hope that Trino can be placed first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8755: [HUDI-6237] Fix call stats_file_sizes failure error due to empty glob…

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8755:
URL: https://github.com/apache/hudi/pull/8755#issuecomment-1553861029

   
   ## CI report:
   
   * 2b0ddb3813e46f5f71a357f1fc2191801b17beb6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17191)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8752: [HUDI-6236] write hive_style_partitioning_enable to table config in D…

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8752:
URL: https://github.com/apache/hudi/pull/8752#issuecomment-1553800662

   
   ## CI report:
   
   * 7762747d22f8ffade79936aa3465db3ce89045db UNKNOWN
   * f9b5f2d4727ffabc20e3e28e78e49009f6fa221e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17189)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8758:
URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553767852

   
   ## CI report:
   
   * a3122900f5c45636d4199da29276f240776fba73 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17197)
 
   * 3cafa50dd40057e2df678d5936e5b926f4ee77f8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17199)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8753: [HUDI-5095] Flink: Stores a special watermark(flag) to identify the current progress of writing data

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8753:
URL: https://github.com/apache/hudi/pull/8753#issuecomment-1553767810

   
   ## CI report:
   
   * 45446c0cbf27d46589e8de1e9cc66221c420e353 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17188)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8760: [HUDI-6238] Disabling clustering for single file group

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8760:
URL: https://github.com/apache/hudi/pull/8760#issuecomment-1553761798

   
   ## CI report:
   
   * 6df809a86f0678a952b496eb95e0d5715ca7c401 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8758:
URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553761751

   
   ## CI report:
   
   * 364da977ec98223c416a37128e55ab033782f0b2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193)
 
   * a3122900f5c45636d4199da29276f240776fba73 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17197)
 
   * 3cafa50dd40057e2df678d5936e5b926f4ee77f8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan opened a new pull request, #8760: [HUDI-6238] Disabling clustering for single file group

2023-05-18 Thread via GitHub



nsivabalan opened a new pull request, #8760:
URL: https://github.com/apache/hudi/pull/8760

   ### Change Logs
   
   When there is only one file group for a given partition, we should avoid 
clustering irrespective of whether sorting is enabled or not. Even if the data 
within single file group may not be sorted, we don't really need to sort them, 
since the stats are going to remain intact before and after sorting (total 
valid values, min and max). So, even when sorting is enabled, we should not 
trigger clustering when file group count is just 1. 
   
   ### Impact
   
   Will assist in avoid repeated clustering for partitions having 1 file group. 
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6238) Avoid clustering when there is only one file slice

2023-05-18 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-6238:
-

 Summary: Avoid clustering when there is only one file slice 
 Key: HUDI-6238
 URL: https://issues.apache.org/jira/browse/HUDI-6238
 Project: Apache Hudi
  Issue Type: Improvement
  Components: clustering
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8759: Add metrics counters for compaction start/stop events.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8759:
URL: https://github.com/apache/hudi/pull/8759#issuecomment-1553725623

   
   ## CI report:
   
   * fbdd1d299bdf653c65f21c374e0aada9b768318f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17198)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8759: Add metrics counters for compaction start/stop events.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8759:
URL: https://github.com/apache/hudi/pull/8759#issuecomment-1553719273

   
   ## CI report:
   
   * fbdd1d299bdf653c65f21c374e0aada9b768318f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8714: [HUDI-6212] Hudi spark 3.0.x adoption

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8714:
URL: https://github.com/apache/hudi/pull/8714#issuecomment-1553710804

   
   ## CI report:
   
   * b3da8cccadddc1cc95c08ad0643a763726a9a010 UNKNOWN
   * 8dbee823426fe3a74d68084e1c47aedc90939a7a UNKNOWN
   * 722fae1d1a8717873ed89b87fb08b7c74fa3ccf5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17187)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

2023-05-18 Thread via GitHub



nsivabalan commented on code in PR #8758:
URL: https://github.com/apache/hudi/pull/8758#discussion_r1198316621


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkMetadataTableRecordIndex.java:
##
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.data.HoodiePairData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordGlobalLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.data.HoodieJavaPairRDD;
+import org.apache.hudi.data.HoodieJavaRDD;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
+import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.function.PairFlatMapFunction;
+import org.apache.spark.sql.execution.PartitionIdPassthrough;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import scala.Tuple2;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN;
+
+/**
+ * Hoodie Index implementation backed by the record index present in the 
Metadata Table.
+ */
+public class SparkMetadataTableRecordIndex extends HoodieIndex 
{
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(SparkMetadataTableRecordIndex.class);
+
+  public SparkMetadataTableRecordIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public  HoodieData> 
tagLocation(HoodieData> records, HoodieEngineContext context, 
HoodieTable hoodieTable) throws HoodieIndexException {
+int fileGroupSize;
+try {
+  
ValidationUtils.checkState(hoodieTable.getMetaClient().getTableConfig().isMetadataPartitionEnabled(MetadataPartitionType.RECORD_INDEX));
+  fileGroupSize = 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(hoodieTable.getMetaClient(),
 (HoodieTableFileSystemView) hoodieTable.getFileSystemView(),
+  MetadataPartitionType.RECORD_INDEX.getPartitionPath()).size();
+  ValidationUtils.checkState(fileGroupSize > 0, "Record index should have 
at least one file group");
+} catch (TableNotFoundException | IllegalStateException e) {
+  // This means that record index has not been initialized. Fallback to 
another index so that tagLocation is still accurate and there are no duplicates.
+  // Fallback index needs to be a global index like record index.
+  HoodieIndex.IndexType fallbackIndexType = IndexType.SIMPLE;
+  LOG.warn(String.format("Record index not initialized so falling back to 
%s for tagging records", fallbackIndexType.name()));
+  HoodieWriteConfig otherConfig = 
HoodieWriteConfig.newBuilder().withProperties(config.getProps())
+  
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(fallbackIndexType).build()).build();
+  HoodieIndex fallbackIndex = 
SparkHoodieIndexFactory.createIndex(otherConfig);
+  return fallbackIndex.tagLocation(records, context, hoodieTable);
+}
+
+// final variable required for lamda functions below
+final int numFileGroups = fileGroupSize;
+
+

[GitHub] [hudi] amrishlal opened a new pull request, #8759: Add metrics counters for compaction start/stop events.

2023-05-18 Thread via GitHub



amrishlal opened a new pull request, #8759:
URL: https://github.com/apache/hudi/pull/8759

   ### Change Logs
   Add metrics counters for compaction start/stop events so that we can keep 
track of how many compactions were requested, how many finished, and how many 
produced error (interfered as `number of starts - number of finished`).
   
   ### Impact
   No user api for performance impact expected.
   
   ### Risk level (write none, low medium or high below)
   Low
   
   ### Documentation Update
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8758:
URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553650530

   
   ## CI report:
   
   * 364da977ec98223c416a37128e55ab033782f0b2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193)
 
   * a3122900f5c45636d4199da29276f240776fba73 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17197)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8638: added new exception types

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8638:
URL: https://github.com/apache/hudi/pull/8638#issuecomment-1553649953

   
   ## CI report:
   
   * c8cf2d86b1be30d3215b3b6e89b8bda33a1fe5dc UNKNOWN
   * 333d9faa53e71ba535a7cb8c60ce8b350a33452c UNKNOWN
   * 6898285dd6ddba725ae33b73aa92afa02beb98a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17090)
 
   * aa35b5562c16840b5ebf143009beac2c291de2c9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17196)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8638: added new exception types

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8638:
URL: https://github.com/apache/hudi/pull/8638#issuecomment-1553638442

   
   ## CI report:
   
   * c8cf2d86b1be30d3215b3b6e89b8bda33a1fe5dc UNKNOWN
   * 333d9faa53e71ba535a7cb8c60ce8b350a33452c UNKNOWN
   * 6898285dd6ddba725ae33b73aa92afa02beb98a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17090)
 
   * aa35b5562c16840b5ebf143009beac2c291de2c9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8758:
URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553639447

   
   ## CI report:
   
   * 364da977ec98223c416a37128e55ab033782f0b2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193)
 
   * a3122900f5c45636d4199da29276f240776fba73 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-18 Thread via GitHub



rmahindra123 commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1198260778


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/Transformer.java:
##
@@ -45,4 +47,9 @@ public interface Transformer {
*/
   @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties);
+
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  default Option transformedSchema(JavaSparkContext jsc, SparkSession 
sparkSession, Schema incomingSchema, TypedProperties properties) {
+return Option.empty();

Review Comment:
   1. Switch to StructType instead of Avro? @rmahindra123 will confirm
   2. Default should infer schema using spark plan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1553587888

   
   ## CI report:
   
   * 7be538f4045a42ba33ab8fe62594178e2db75bbb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17100)
 
   * cc0da2372d50d99c98c2ce4bcbe5a60303bde938 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17195)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on a diff in pull request #8574: [HUDI-6139] Add support for Transformer schema validation in DeltaStreamer

2023-05-18 Thread via GitHub



rmahindra123 commented on code in PR #8574:
URL: https://github.com/apache/hudi/pull/8574#discussion_r1198256724


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java:
##
@@ -93,9 +103,13 @@ public List getTransformersNames() {
   @Override
   public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties) {
 Dataset dataset = rowDataset;
+Option incomingSchemaOpt = sourceSchemaOpt;
 for (TransformerInfo transformerInfo : transformers) {
   Transformer transformer = transformerInfo.getTransformer();
   dataset = transformer.apply(jsc, sparkSession, dataset, 
transformerInfo.getProperties(properties));
+  if (enableSchemaValidation) {
+incomingSchemaOpt = validateAndGetTransformedSchema(transformerInfo, 
dataset, incomingSchemaOpt, jsc, sparkSession, properties);

Review Comment:
   Implement the new interface for chained transformer and validate before the 
dataset apply is called. Validation should be in the new interface instead of 
within the apply method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8684: [HUDI-6200] Enhancements to the MDT for improving performance of larger indexes.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8684:
URL: https://github.com/apache/hudi/pull/8684#issuecomment-1553577735

   
   ## CI report:
   
   * 7be538f4045a42ba33ab8fe62594178e2db75bbb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17100)
 
   * cc0da2372d50d99c98c2ce4bcbe5a60303bde938 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8520: [HUDI-6115] Hardening expectation of corruptRecordColumn in ChainedTransformer.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8520:
URL: https://github.com/apache/hudi/pull/8520#issuecomment-1553567992

   
   ## CI report:
   
   * 0ad850ba8e43e954d5a83ffb6a4e68bc3e3dd68b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17185)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (9ef7bd8a675 -> cfa02f2dd99)

2023-05-18 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 9ef7bd8a675 [HUDI-5394] Fix tests for RowCustomColumnsSortPartitioner 
(#8741)
 add cfa02f2dd99 [HUDI-6228] Re-enable tests that were flaky before (#8733)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java   | 3 ---
 1 file changed, 3 deletions(-)

[GitHub] [hudi] yihua commented on pull request #8733: [HUDI-6228] Re-enable tests that were flaky before

2023-05-18 Thread via GitHub



yihua commented on PR #8733:
URL: https://github.com/apache/hudi/pull/8733#issuecomment-1553522777

   > ## CI report:
   > * 
[d8d8926](https://github.com/apache/hudi/commit/d8d892647691c4bdf9f7bb78313db328517d7552)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17145)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17159)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17170)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17175)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17184)
   > 
   > Bot commands
   
   The fourth run failed due to other flaky tests in `hudi-common` and memory 
issue of the Azure worker.  The fifth run failed due to a Spark flaky test.  
Both are not due to the re-enabled tests added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

2023-05-18 Thread via GitHub



nbalajee commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1198202805


##
rfc/rfc-65/rfc-65.md:
##
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.

Review Comment:
   When retiring the old/unused/not-accessed partitions, another approach we 
are taking internally is:
   (a) stash the partitions to be cleaned up in .stashedForDeletion folder (at 
.hoodie level).
   (b) partitions stashed for deletion will wait in the folder for a week (or 
time dictated by the policy) before actually getting deleted.  In cases, where 
we realize that something has been accidentally deleted (like a bad policy 
configuration,  TTL exclusion not configured etc), we can always move back from 
the stash to quickly recover from the TTL event.
   (c) We shall configure policies for .stashedForDeletion// 
subfolders to manage for appropriate tiering level (whether to be moved to a 
warm/cold tier etc)
   (d) in addition to the deletePartitions() API, which would stash the folder 
(instead of deleting) based on the configs, we would need a restore API to move 
the subfolder/files back to their original location. 
   (e) Metadata left by the delete operation to be synced with MDT to keep the 
file listing metadata in sync with the file system.  (In cases where 
replication to a different region is supported, this also would warrant 
applying the changes on the replicated copies of data).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8733: [HUDI-6228] Re-enable tests that were flaky before

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8733:
URL: https://github.com/apache/hudi/pull/8733#issuecomment-1553502569

   
   ## CI report:
   
   * d8d892647691c4bdf9f7bb78313db328517d7552 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17145)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17159)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17170)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17175)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17184)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8748: [HUDI-6234] make sure clean is run after flink table service

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8748:
URL: https://github.com/apache/hudi/pull/8748#issuecomment-1553489437

   
   ## CI report:
   
   * bf27bd0b77a73d9e0f101e6d37947c21a86bf47c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17182)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #8278: [SUPPORT] Deltastreamer Fails with AWSDmsAvroPayload

2023-05-18 Thread via GitHub



yihua commented on issue #8278:
URL: https://github.com/apache/hudi/issues/8278#issuecomment-1553383695

   On my side, I verified that I no longer see the exception using the same 
script you shared after the fix (I was hitting the issue before the fix).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #8278: [SUPPORT] Deltastreamer Fails with AWSDmsAvroPayload

2023-05-18 Thread via GitHub



yihua commented on issue #8278:
URL: https://github.com/apache/hudi/issues/8278#issuecomment-1553382861

   Hi @Hans-Raintree sorry for the delay.  This merged fix #8690 on master 
should fix your issue.  Could you give it a try?  The fix is included in the 
upcoming 0.13.1 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

2023-05-18 Thread via GitHub



hudi-bot commented on PR #8758:
URL: https://github.com/apache/hudi/pull/8758#issuecomment-1553360233

   
   ## CI report:
   
   * 364da977ec98223c416a37128e55ab033782f0b2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17193)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 194 matches

Mail list logo