date:20230317

[GitHub] [hudi] hudi-bot commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8080:
URL: https://github.com/apache/hudi/pull/8080#issuecomment-1474753012

   
   ## CI report:
   
   * cef6b97830348253425a254e5da6147ff755595d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15779)
 
   * 36500b7b23e97c42049e605e70b11aee9b8763ec UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8227: [HUDI-5952] Fix NPE when use kafka callback

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8227:
URL: https://github.com/apache/hudi/pull/8227#issuecomment-1474749029

   
   ## CI report:
   
   * 4cc7ab6ab87a640bcb68c97c55f642fde9ed5ecc UNKNOWN
   * 73861c38e3d67d687ef89ebff0fc08170892fec3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15780)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8227: [HUDI-5952] Fix NPE when use kafka callback

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8227:
URL: https://github.com/apache/hudi/pull/8227#issuecomment-1474703766

   
   ## CI report:
   
   * 4cc7ab6ab87a640bcb68c97c55f642fde9ed5ecc UNKNOWN
   * 73861c38e3d67d687ef89ebff0fc08170892fec3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15780)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8227: [HUDI-5952] Fix NPE when use kafka callback

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8227:
URL: https://github.com/apache/hudi/pull/8227#issuecomment-1474698643

   
   ## CI report:
   
   * 4cc7ab6ab87a640bcb68c97c55f642fde9ed5ecc UNKNOWN
   * 73861c38e3d67d687ef89ebff0fc08170892fec3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8080:
URL: https://github.com/apache/hudi/pull/8080#issuecomment-1474698588

   
   ## CI report:
   
   * cef6b97830348253425a254e5da6147ff755595d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15779)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5952) NullPointerException when use kafka callback

2023-03-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5952:
-
Labels: pull-request-available  (was: )

> NullPointerException when use kafka callback
> 
>
> Key: HUDI-5952
> URL: https://issues.apache.org/jira/browse/HUDI-5952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hudi-utilities
>Reporter: wuzhenhua
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-03-18-11-41-35-135.png
>
>
> hudi.conf:
> hoodie.write.commit.callback.on true
> hoodie.write.commit.callback.class 
> org.apache.hudi.utilities.callback.kafka.HoodieWriteCommitKafkaCallback
> hoodie.write.commit.callback.kafka.bootstrap.servers localhost:9082
> hoodie.write.commit.callback.kafka.topic hudi-callback
> hoodie.write.commit.callback.kafka.partition 1
>  
> !image-2023-03-18-11-41-35-135.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8227: [HUDI-5952] Fix NPE when use kafka callback

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8227:
URL: https://github.com/apache/hudi/pull/8227#issuecomment-1474678283

   
   ## CI report:
   
   * 4cc7ab6ab87a640bcb68c97c55f642fde9ed5ecc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wuzhenhua01 opened a new pull request, #8227: [Hudi-5952] Fix NPE when use kafka callback

2023-03-17 Thread via GitHub



wuzhenhua01 opened a new pull request, #8227:
URL: https://github.com/apache/hudi/pull/8227

   ### Change Logs
   
   Fix NPE when use kafka callback
   
   ### Impact
   
   No impact
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5952) NullPointerException when use kafka callback

2023-03-17 Thread wuzhenhua (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuzhenhua updated HUDI-5952:

Summary: NullPointerException when use kafka callback  (was: 
NullPointerException used by kafka callback)

> NullPointerException when use kafka callback
> 
>
> Key: HUDI-5952
> URL: https://issues.apache.org/jira/browse/HUDI-5952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hudi-utilities
>Reporter: wuzhenhua
>Priority: Major
> Attachments: image-2023-03-18-11-41-35-135.png
>
>
> hudi.conf:
> hoodie.write.commit.callback.on true
> hoodie.write.commit.callback.class 
> org.apache.hudi.utilities.callback.kafka.HoodieWriteCommitKafkaCallback
> hoodie.write.commit.callback.kafka.bootstrap.servers localhost:9082
> hoodie.write.commit.callback.kafka.topic hudi-callback
> hoodie.write.commit.callback.kafka.partition 1
>  
> !image-2023-03-18-11-41-35-135.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-5952) NullPointerException used by kafka callback

2023-03-17 Thread wuzhenhua (Jira)

wuzhenhua created HUDI-5952:
---

 Summary: NullPointerException used by kafka callback
 Key: HUDI-5952
 URL: https://issues.apache.org/jira/browse/HUDI-5952
 Project: Apache Hudi
  Issue Type: Bug
  Components: hudi-utilities
Reporter: wuzhenhua
 Attachments: image-2023-03-18-11-41-35-135.png

hudi.conf:

hoodie.write.commit.callback.on true
hoodie.write.commit.callback.class 
org.apache.hudi.utilities.callback.kafka.HoodieWriteCommitKafkaCallback

hoodie.write.commit.callback.kafka.bootstrap.servers localhost:9082
hoodie.write.commit.callback.kafka.topic hudi-callback

hoodie.write.commit.callback.kafka.partition 1

 

!image-2023-03-18-11-41-35-135.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8226: [HUDI-5951] Replace format `org.apache.hudi` with short name `hudi` in Spark Datasource

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8226:
URL: https://github.com/apache/hudi/pull/8226#issuecomment-1474650335

   
   ## CI report:
   
   * 27203c20ce6dc7f998109e8a2ee7b1c05201dc61 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15777)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7245: [HUDI-5238] Fixing `HoodieMergeHandle` shutdown sequence

2023-03-17 Thread via GitHub



hudi-bot commented on PR #7245:
URL: https://github.com/apache/hudi/pull/7245#issuecomment-1474649827

   
   ## CI report:
   
   * 811eac56bb4aedf7b2585002f016e8c3580c49b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15778)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] renshangtao commented on pull request #8200: The hoodie.datasource.write.row.writer.enable should set to be true.

2023-03-17 Thread via GitHub



renshangtao commented on PR #8200:
URL: https://github.com/apache/hudi/pull/8200#issuecomment-1474584244

   > on, I got it, the default value in config is true. But I think it will not 
lead to the differences of sorting results
   
   You can test it，if the value is false , it will create a 
RDDCustomColumnsSortPartitioner who's class description is "
   A partitioner that does sorting based on specified column values for each 
RDD partition."
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8223: [HUDI-5950] Fixing pending instant deduction to trigger compaction in MDT

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8223:
URL: https://github.com/apache/hudi/pull/8223#issuecomment-1474565624

   
   ## CI report:
   
   * 20827b4d293981f971d183ad127d0f94cb82c2d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15775)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8225: fix to get rid of another issue

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8225:
URL: https://github.com/apache/hudi/pull/8225#issuecomment-1474525639

   
   ## CI report:
   
   * 90a62c81f6240ea04d772b14d2f2c16ae23ad9b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15773)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8080:
URL: https://github.com/apache/hudi/pull/8080#issuecomment-1474525511

   
   ## CI report:
   
   * 7bbcd89076893ece4bfe24d9969148316c33cee2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15511)
 
   * cef6b97830348253425a254e5da6147ff755595d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15779)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-03-17 Thread via GitHub



hudi-bot commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1474525335

   
   ## CI report:
   
   * 1bf2b8c5d85a0c2dbbc2ea8c2274eea254b9d8d4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15774)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7245: [HUDI-5238] Fixing `HoodieMergeHandle` shutdown sequence

2023-03-17 Thread via GitHub



hudi-bot commented on PR #7245:
URL: https://github.com/apache/hudi/pull/7245#issuecomment-1474524892

   
   ## CI report:
   
   * 9fd17ce64b8fef603422ff1b7644e9bb2f7dea2e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15281)
 
   * 811eac56bb4aedf7b2585002f016e8c3580c49b4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15778)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8226: [HUDI-5951] Replace format `org.apache.hudi` with short name `hudi` in Spark Datasource

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8226:
URL: https://github.com/apache/hudi/pull/8226#issuecomment-1474519204

   
   ## CI report:
   
   * 27203c20ce6dc7f998109e8a2ee7b1c05201dc61 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15777)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8080:
URL: https://github.com/apache/hudi/pull/8080#issuecomment-1474518985

   
   ## CI report:
   
   * 7bbcd89076893ece4bfe24d9969148316c33cee2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15511)
 
   * cef6b97830348253425a254e5da6147ff755595d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7245: [HUDI-5238] Fixing `HoodieMergeHandle` shutdown sequence

2023-03-17 Thread via GitHub



hudi-bot commented on PR #7245:
URL: https://github.com/apache/hudi/pull/7245#issuecomment-1474518290

   
   ## CI report:
   
   * 9fd17ce64b8fef603422ff1b7644e9bb2f7dea2e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15281)
 
   * 811eac56bb4aedf7b2585002f016e8c3580c49b4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8226: [HUDI-5951] Replace format `org.apache.hudi` with short name `hudi` in Spark Datasource

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8226:
URL: https://github.com/apache/hudi/pull/8226#issuecomment-1474514890

   
   ## CI report:
   
   * 27203c20ce6dc7f998109e8a2ee7b1c05201dc61 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5951:
-
Labels: pull-request-available  (was: )

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> HUDI-372 adds support for the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danielfordfc commented on pull request #8220: [HUDI-8065] - Using Cached Consumer that clears between batch writes

2023-03-17 Thread via GitHub



danielfordfc commented on PR #8220:
URL: https://github.com/apache/hudi/pull/8220#issuecomment-1474513310

   Got the branch name wrong so this is unnecessary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danielfordfc closed pull request #8220: [HUDI-8065] - Using Cached Consumer that clears between batch writes

2023-03-17 Thread via GitHub



danielfordfc closed pull request #8220: [HUDI-8065] - Using Cached Consumer 
that clears between batch writes
URL: https://github.com/apache/hudi/pull/8220


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] gudladona commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation

2023-03-17 Thread via GitHub



gudladona commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1474500670

   We may have some indicators on what is causing this problem 
   
   we have a small file limit of 100MB, it appears that this works well (makes 
larger files and cleans smaller files) for an average partitions that meets the 
size requirements.
   
   however, for a partition thats very busy/high volume. it seems like its over 
bucketing the inserts into many files bec based on avg rec size and the size of 
new inserts it would always exceed the file size limits and causing it to write 
to a new file group
   
   example, here is number of file groups written for a single instant(commit) 
in this partition
   
   ```
   aws s3 ls s3:/date=20230316/ | awk -F _ '{print 
$3}' | sort | uniq -c | sort -nk1  | tail
167 20230316203454183.parquet
168 20230316195218670.parquet
168 20230316201208079.parquet
170 20230316200728433.parquet
175 20230316210557345.parquet
180 20230316130454342.parquet
182 20230316212237421.parquet
211 20230316192405566.parquet
245 20230316210251305.parquet
263 20230316204926437.parquet
   ```
   
   As we can see here the shear number of small files in this partition is 
causing a HUGE json response from the driver there by triggering OOM errors. 
   
   we need help in figuring out how to tune this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Status: Patch Available  (was: In Progress)

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> HUDI-372 adds support for the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Fix Version/s: 0.14.0

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> HUDI-372 adds support for the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Status: In Progress  (was: Open)

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> HUDI-372 adds support for the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yihua opened a new pull request, #8226: [MINOR] Replace format `org.apache.hudi` with `hudi`

2023-03-17 Thread via GitHub



yihua opened a new pull request, #8226:
URL: https://github.com/apache/hudi/pull/8226

   ### Change Logs
   
   HUDI-372 (#1054) adds support for the short name `"hudi"` in Spark 
Datasource read and write (`df.write.format("hudi")`, 
`df.read.format("hudi")`).  All places should use "hudi" with `format()` now.
   
   ### Impact
   
   None.  Code clean-up.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Description: HUDI-372 adds support of the short name "hudi" in Spark 
Datasource read and write (df.write.format("hudi"), df.read.format("hudi")).  
All places should use "hudi" with format() now.  (was: HUDI-372 adds supports 
of the short name "hudi" in Spark Datasource read and write 
(df.write.format("hudi"), df.read.format("hudi")).  All places should use 
"hudi" with format() now.)

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> HUDI-372 adds support of the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Description: HUDI-372 adds support for the short name "hudi" in Spark 
Datasource read and write (df.write.format("hudi"), df.read.format("hudi")).  
All places should use "hudi" with format() now.  (was: HUDI-372 adds support of 
the short name "hudi" in Spark Datasource read and write 
(df.write.format("hudi"), df.read.format("hudi")).  All places should use 
"hudi" with format() now.)

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> HUDI-372 adds support for the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Story Points: 0.5

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> HUDI-372 adds supports of the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Description: HUDI-372 adds supports of the short name "hudi" in Spark 
Datasource read and write (df.write.format("hudi"), df.read.format("hudi")).  
All places should use "hudi" with format() now.

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
>
> HUDI-372 adds supports of the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-5951:
---

Assignee: Ethan Guo

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> HUDI-372 adds supports of the short name "hudi" in Spark Datasource read and 
> write (df.write.format("hudi"), df.read.format("hudi")).  All places should 
> use "hudi" with format() now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource

2023-03-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5951:

Summary: Replace format "org.apache.hudi" with short name "hudi" in Spark 
Datasource  (was: Replace format "org.apache.hudi" with short name "hudi" in 
Spark datasource)

> Replace format "org.apache.hudi" with short name "hudi" in Spark Datasource
> ---
>
> Key: HUDI-5951
> URL: https://issues.apache.org/jira/browse/HUDI-5951
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-5951) Replace format "org.apache.hudi" with short name "hudi" in Spark datasource

2023-03-17 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-5951:
---

 Summary: Replace format "org.apache.hudi" with short name "hudi" 
in Spark datasource
 Key: HUDI-5951
 URL: https://issues.apache.org/jira/browse/HUDI-5951
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-5920] Improve documentation of parallelism configs (#8157)

2023-03-17 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e51b4575cb7 [HUDI-5920] Improve documentation of parallelism configs 
(#8157)
e51b4575cb7 is described below

commit e51b4575cb7642eb61bcc02d95c99466dd3e8eda
Author: Y Ethan Guo 
AuthorDate: Fri Mar 17 15:17:24 2023 -0700

[HUDI-5920] Improve documentation of parallelism configs (#8157)

This commit improves the documentation for the following parallelism 
configs:
- hoodie.archive.delete.parallelism
- hoodie.bloom.index.parallelism
- hoodie.simple.index.parallelism
- hoodie.global.simple.index.parallelism
- hoodie.insert.shuffle.parallelism
- hoodie.bulkinsert.shuffle.parallelism
- hoodie.upsert.shuffle.parallelism
- hoodie.delete.shuffle.parallelism
- hoodie.rollback.parallelism
---
 .../apache/hudi/config/HoodieArchivalConfig.java   |  9 +++-
 .../org/apache/hudi/config/HoodieIndexConfig.java  | 17 ++--
 .../org/apache/hudi/config/HoodieWriteConfig.java  | 48 ++
 3 files changed, 63 insertions(+), 11 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java
index 681ca20baee..fdfdab5897a 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java
@@ -63,7 +63,14 @@ public class HoodieArchivalConfig extends HoodieConfig {
   public static final ConfigProperty 
DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE = ConfigProperty
   .key("hoodie.archive.delete.parallelism")
   .defaultValue(100)
-  .withDocumentation("Parallelism for deleting archived hoodie commits.");
+  .withDocumentation("When performing archival operation, Hudi needs to 
delete the files of "
+  + "the archived instants in the active timeline in .hoodie folder. 
The file deletion "
+  + "also happens after merging small archived files into larger ones 
if enabled. "
+  + "This config limits the Spark parallelism for deleting files in 
both cases, i.e., "
+  + "parallelism of deleting files does not go above the configured 
value and the "
+  + "parallelism is the number of files to delete if smaller than the "
+  + "configured value.  If you see that the file deletion in archival 
operation is slow "
+  + "because of the limited parallelism, you can increase this to tune 
the performance.");
 
   public static final ConfigProperty MIN_COMMITS_TO_KEEP = 
ConfigProperty
   .key("hoodie.keep.min.commits")
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
index 366e6aa4c04..0767fc46a56 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
@@ -115,7 +115,10 @@ public class HoodieIndexConfig extends HoodieConfig {
   .defaultValue("0")
   .withDocumentation("Only applies if index type is BLOOM. "
   + "This is the amount of parallelism for index lookup, which 
involves a shuffle. "
-  + "By default, this is auto computed based on input workload 
characteristics.");
+  + "By default, this is auto computed based on input workload 
characteristics. "
+  + "If the parallelism is explicitly configured by the user, the 
user-configured "
+  + "value is used in defining the actual parallelism. If the indexing 
stage is slow "
+  + "due to the limited parallelism, you can increase this to tune the 
performance.");
 
   public static final ConfigProperty BLOOM_INDEX_PRUNE_BY_RANGES = 
ConfigProperty
   .key("hoodie.bloom.index.prune.by.ranges")
@@ -181,13 +184,21 @@ public class HoodieIndexConfig extends HoodieConfig {
   .key("hoodie.simple.index.parallelism")
   .defaultValue("100")
   .withDocumentation("Only applies if index type is SIMPLE. "
-  + "This is the amount of parallelism for index lookup, which 
involves a Spark Shuffle");
+  + "This limits the parallelism of fetching records from the base 
files of affected "
+  + "partitions. The index picks the configured parallelism if the 
number of base "
+  + "files is larger than this configured value; otherwise, the number 
of base files "
+  + "is used as the parallelism. If the indexing stage is slow due to 
the limited "
+  + "parallelism, you can increase this to tune the performance.");
 
   p

[GitHub] [hudi] yihua merged pull request #8157: [HUDI-5920] Improve documentation of parallelism configs

2023-03-17 Thread via GitHub



yihua merged PR #8157:
URL: https://github.com/apache/hudi/pull/8157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8223: [HUDI-5950] Fixing pending instant deduction to trigger compaction in MDT

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8223:
URL: https://github.com/apache/hudi/pull/8223#issuecomment-1474440744

   
   ## CI report:
   
   * 20827b4d293981f971d183ad127d0f94cb82c2d0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15775)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-03-17 Thread via GitHub



hudi-bot commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1474440222

   
   ## CI report:
   
   * 4e7da703304c7783e9771e931e39854adf6458d6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15153)
 
   * 1bf2b8c5d85a0c2dbbc2ea8c2274eea254b9d8d4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15774)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8225: fix to get rid of another issue

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8225:
URL: https://github.com/apache/hudi/pull/8225#issuecomment-1474434417

   
   ## CI report:
   
   * 90a62c81f6240ea04d772b14d2f2c16ae23ad9b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15773)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8223: [HUDI-5950] Fixing pending instant deduction to trigger compaction in MDT

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8223:
URL: https://github.com/apache/hudi/pull/8223#issuecomment-1474434388

   
   ## CI report:
   
   * 20827b4d293981f971d183ad127d0f94cb82c2d0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-03-17 Thread via GitHub



hudi-bot commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1474433372

   
   ## CI report:
   
   * 4e7da703304c7783e9771e931e39854adf6458d6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15153)
 
   * 1bf2b8c5d85a0c2dbbc2ea8c2274eea254b9d8d4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8225: fix to get rid of another issue

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8225:
URL: https://github.com/apache/hudi/pull/8225#issuecomment-1474428513

   
   ## CI report:
   
   * 90a62c81f6240ea04d772b14d2f2c16ae23ad9b9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-03-17 Thread via GitHub



nsivabalan commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1474414016

   Since this has been approved already, will go ahead and merge once CI is 
green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #8176: [HUDI-5929] Automatically infer key generator type

2023-03-17 Thread via GitHub



nsivabalan commented on PR #8176:
URL: https://github.com/apache/hudi/pull/8176#issuecomment-1474412360

   Also, do you think we should fix our quick start to start using key gen type 
instead of class name. 
   Also, we might also need to add docs around this auto inference and clarify 
that users don't need to set key gen class or type for some of the key 
generators


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8176: [HUDI-5929] Automatically infer key generator type

2023-03-17 Thread via GitHub



nsivabalan commented on code in PR #8176:
URL: https://github.com/apache/hudi/pull/8176#discussion_r1140709444


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java:
##
@@ -75,40 +79,60 @@ public static KeyGenerator 
createKeyGenerator(TypedProperties props) throws IOEx
 }
   }
 
+  /**
+   * @param type {@link KeyGeneratorType} enum.
+   * @return The key generator class name for Spark based on the {@link 
KeyGeneratorType}.
+   */
+  public static String getKeyGeneratorClassNameFromType(KeyGeneratorType type) 
{
+switch (type) {
+  case SIMPLE:
+return SimpleKeyGenerator.class.getName();
+  case COMPLEX:
+return ComplexKeyGenerator.class.getName();
+  case TIMESTAMP:
+return TimestampBasedKeyGenerator.class.getName();
+  case CUSTOM:
+return CustomKeyGenerator.class.getName();
+  case NON_PARTITION:
+return NonpartitionedKeyGenerator.class.getName();
+  case GLOBAL_DELETE:
+return GlobalDeleteKeyGenerator.class.getName();
+  default:
+throw new HoodieKeyGeneratorException("Unsupported keyGenerator Type " 
+ type);
+}
+  }
+
+  /**
+   * Infers the key generator type based on the record key and partition 
fields.
+   * If neither of the record key and partition fields are set, the default 
type is returned.
+   *
+   * @param props Properties from the write config.
+   * @return Inferred key generator type.
+   */
+  public static KeyGeneratorType 
inferKeyGeneratorTypeFromWriteConfig(TypedProperties props) {

Review Comment:
   is this required to be public ? 



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala:
##
@@ -300,7 +302,7 @@ class HoodieCatalogTable(val spark: SparkSession, var 
table: CatalogTable) exten
   val primaryKeys = 
table.properties.getOrElse(SQL_KEY_TABLE_PRIMARY_KEY.sqlKeyName, 
table.storage.properties.get(SQL_KEY_TABLE_PRIMARY_KEY.sqlKeyName)).toString
   val partitions = table.partitionColumnNames.mkString(",")
   extraConfig(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key) =
-DataSourceOptionsHelper.inferKeyGenClazz(primaryKeys, partitions)
+getKeyGeneratorClassNameFromType(inferKeyGeneratorType(primaryKeys, 
partitions))

Review Comment:
   should we introduce InferKeyGenClassfromProps() and internally we can call 
inferKeyGeneratorType()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jonvex opened a new pull request, #8225: fix to get rid of another issue

2023-03-17 Thread via GitHub



jonvex opened a new pull request, #8225:
URL: https://github.com/apache/hudi/pull/8225

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

2023-03-17 Thread via GitHub



berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474390249

   Also, also, the target table doesn't exist before the run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

2023-03-17 Thread via GitHub



berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474389849

   Also, no data is written to the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

2023-03-17 Thread via GitHub



berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474387765

   I tried this on `emr-6.9.0` and got the same result.
   
   The other settings from my properties file are being propagated over to the 
file in `.hoodie` like `hoodie.table.partition.fields=origin_schema` and 
`hoodie.table.recordkey.fields=origin_schema,id`.
   
   It's just the `hoodie.datasource.write.precombine.field = updated_at` 
setting in my properties file ends up as `hoodie.table.precombine.field=ts` in 
the table properties file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] berniedurfee-renaissance opened a new issue, #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

2023-03-17 Thread via GitHub



berniedurfee-renaissance opened a new issue, #8224:
URL: https://github.com/apache/hudi/issues/8224

   I'm running Deltastreamer in AWS EMR Serverless and it seems that 
Deltastreamer is ignoring `hoodie.datasource.write.precombine.field` in my 
config file.

   **To Reproduce**
   
   Setup a source bucket of parquet files (Mine are from AWS DMS): 
`s3://my-lakehouse/dms-output-raw/schema_1/table_1/`
   
   Add a properties file 
(`s3://my-lakehouse/deltastreamer-config/deltastreamer.properties`): 
   ```
   hoodie.schema.on.read.enable = true
   hoodie.datasource.write.recordkey.field = origin_schema,id
   hoodie.datasource.write.precombine.field = updated_at
   hoodie.datasource.write.partitionpath.field = origin_schema
   hoodie.datasource.write.keygenerator.class = 
org.apache.hudi.keygen.ComplexKeyGenerator
   
hoodie.deltastreamer.source.dfs.root=s3://my-lakehouse/dms-output-raw/schema_1/table_1/
   ```
   
   Submit a job to EMR Serverless 6.10.0
   - Script Location: 
`s3://my-lakehouse/deltastreamer-jar/hudi-utilities-slim-bundle_2.12-0.13.0.jar`
   - Main Class: `org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer`
   - Script arguments
   ```
   
["--table-type","COPY_ON_WRITE","--target-base-path","s3://my-lakehouse/deltastreamer-out","--target-table","table1","--source-class","org.apache.hudi.utilities.sources.ParquetDFSSource","--props","s3://my-lakehouse/deltastreamer-config/deltastreamer.properties"]
   ```
   - Properties key 1: `spark.serializer` = 
`org.apache.spark.serializer.KryoSerializer`
   - Properties key 2: `spark.jars` = 
`s3://ec-lakehouse-qa/deltastreamer-jar/hudi-spark3-bundle_2.12-0.13.0.jar`
   
   Everything else is default. I tried 0.12 and 0.13, but same result.
   
   **Expected behavior**
   
   Rows from source are upserted to destination.
   
   Deltastreamer fails and when I look at 
`/my-lakehouse/deltastreamer-out/mytable1/.hoodie/hoodie.properties` I can see 
that `hoodie.table.precombine.field=ts`. That should be 
`hoodie.table.precombine.field=updated_at` because it's what's in the 
properties file, right?
   
   **Environment Description**
   
   * Hudi version : 0.13
   
   * Spark version : 3.3.1
   
   * Hive version : Not sure for EMR Serverless 6.10.0
   
   * Hadoop version : Not sure for EMR Serverless 6.10.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : EMR Serverless 6.10.0
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   ```Job failed, please check complete logs in configured logging destination. 
ExitCode: 1. Last few exceptions: Caused by: 
org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in 
record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, 
xxx, xxx, created_at, updated_at, origin_schema, origin_table] Exception in 
thread "main" org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 2.0 (TID 5) ([2600:1f13:4df:c101:e7b0:42bd:e2f6:f474] executor 1): 
org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in 
record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, 
xxx, xxx, created_at, updated_at, origin_schema, origin_table] Caused by: 
org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in 
record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, 
xxx, xxx, created_at, updated_at, or
 igin_schema, origin_table] org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 2.0 (TID 5) ([2600:1f13:4df:c101:e7b0:42bd:e2f6:f474] 
executor 1): org.apache.hudi.exception.HoodieException: ts(Part -ts) field not 
found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, 
xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] 
23/03/17 20:13:39 ERROR HoodieDeltaStreamer: Got error running delta sync once. 
Shutting down...```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5950) Fix pending instants determination for MDT compaction

2023-03-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5950:
-
Labels: pull-request-available  (was: )

> Fix pending instants determination for MDT compaction
> -
>
> Key: HUDI-5950
> URL: https://issues.apache.org/jira/browse/HUDI-5950
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Currently the logic to deduce pending instants in MDT is as below
>  * a = we get latest completed delta commit from MDT.
>  * Find any inflights in DT timeline *before* \{a}
> and if we don't find any such inflights, we will go ahead and may be compact 
> MDT.
> But what incase the latest delta commit in MDT succeeded in MDT, but failed 
> in DT. so, it could potentially result in triggering compaction in MDT which 
> should not happen. 
>  
> So, the right fix is 
>  * a = we get latest completed delta commit from MDT.
>  * Find any inflights in DT timeline *before or equals* to \{a}
> This should take care of not triggering compaction in MDT when here are 
> inflights in DT which is committed to MDT. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan opened a new pull request, #8223: [HUDI-5950] Fixing pending instant deduction to trigger compaction in MDT

2023-03-17 Thread via GitHub



nsivabalan opened a new pull request, #8223:
URL: https://github.com/apache/hudi/pull/8223

   ### Change Logs
   
   Fixing a corner case bug where compaction in MDT could get triggered w/ 
partially failed commit in DT. 
   
   Currently the logic to deduce pending instants in MDT is as below
   
   a = we get latest completed delta commit from MDT.
   Find any inflights in DT timeline **before** {a}
   and if we don't find any such inflights, we will go ahead and may be compact 
MDT.
   
   But what incase the latest delta commit in MDT succeeded in MDT, but failed 
in DT. so, it could potentially result in triggering compaction in MDT which 
should not happen. 
   
   So, the right fix is 
   
   a = we get latest completed delta commit from MDT.
   Find any inflights in DT timeline **before or equals** to {a}
   This should take care of not triggering compaction in MDT when here are 
inflights in DT which is committed to MDT. 
   
   ### Impact
   
   Stabilizes metadata table. 
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8220: [HUDI-8065] - Using Cached Consumer that clears between batch writes

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8220:
URL: https://github.com/apache/hudi/pull/8220#issuecomment-1474334767

   
   ## CI report:
   
   * 586f0e6c2bab9701e4017616dbb8b667387fcedd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15772)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-5950) Fix pending instants determination for MDT compaction

2023-03-17 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-5950:
-

 Summary: Fix pending instants determination for MDT compaction
 Key: HUDI-5950
 URL: https://issues.apache.org/jira/browse/HUDI-5950
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


Currently the logic to deduce pending instants in MDT is as below
 * a = we get latest completed delta commit from MDT.
 * Find any inflights in DT timeline *before* \{a}

and if we don't find any such inflights, we will go ahead and may be compact 
MDT.

But what incase the latest delta commit in MDT succeeded in MDT, but failed in 
DT. so, it could potentially result in triggering compaction in MDT which 
should not happen. 

 

So, the right fix is 
 * a = we get latest completed delta commit from MDT.
 * Find any inflights in DT timeline *before or equals* to \{a}

This should take care of not triggering compaction in MDT when here are 
inflights in DT which is committed to MDT. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] parisni opened a new issue, #8222: [SUPPORT] Incremental read with MOR does not work as COW

2023-03-17 Thread via GitHub



parisni opened a new issue, #8222:
URL: https://github.com/apache/hudi/issues/8222

   tested on 0.12 and 0.13 with spark 3.2.1
   
   COW and MOR tables does not return same result when dealing with incremental 
read. Here is a reproducible source code + output for both.
   
   For the last commit,
   MOR returns:
   ```
   
+---++--+--+++---++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|uuid| ts|part|
   
+---++--+--+++---++
   |  20230317202718414|20230317202718414...| a|
   foo|20c4c9f3-2cbc-439...|   a|  9| foo|
   
+---++--+--+++---++
   ```
   while COW returns :
   ```
   
+---++--+--+-++---++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|uuid|
 ts|part|
   
+---++--+--+-++---++
   
+---++--+--+-++---++
   ```
   COW is right, while MOR is wrong, because the combine key for the last 
commit is lower than existing record and should not be shown.
   
   ```python
   tableName = "test_hudi_pyspark_local"
   basePath = f"/tmp/{tableName}"
   
   df = spark.sql("select 'a' as uuid, 10 as ts, 'foo' as part")
   #mode="COPY_ON_WRITE"
   mode="MERGE_ON_READ"
   hudi_options = {
   "hoodie.table.name": tableName,
   "hoodie.datasource.write.table.type": mode,
   "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
   "hoodie.datasource.write.recordkey.field": "uuid",
   "hoodie.datasource.write.partitionpath.field": "part",
   "hoodie.datasource.write.table.name": tableName,
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.datasource.write.precombine.field": "ts",
   "hoodie.upsert.shuffle.parallelism": 2,
   "hoodie.insert.shuffle.parallelism": 2,
   "hoodie.datasource.hive_sync.enable": "false",
   "hoodie.datasource.hive_sync.partition_fields": "part",
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   }
   
(df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath))
   spark.read.format("hudi").load(basePath).show()
   
+---++--+--+++---++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|uuid| ts|part|
   
+---++--+--+++---++
   |  20230317202715405|20230317202715405...| a|
   foo|20c4c9f3-2cbc-439...|   a| 10| foo|
   
+---++--+--+++---++
   
   
   df = spark.sql("select 'a' as uuid, 11 as ts, 'foo' as part")
   
(df.write.format("hudi").options(**hudi_options).mode("append").save(basePath))
   spark.read.format("hudi").load(basePath).show()
   
+---++--+--+++---++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|uuid| ts|part|
   
+---++--+--+++---++
   |  20230317202717144|20230317202717144...| a|
   foo|20c4c9f3-2cbc-439...|   a| 11| foo|
   
+---++--+--+++---++
   
   
   df = spark.sql("select 'a' as uuid, 9 as ts, 'foo' as part")
   
(df.write.format("hudi").options(**hudi_options).mode("append").save(basePath))
   spark.read.format("hudi").load(basePath).show()
   
+---++--+--+++---++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|uuid| ts|part|
   
+---++--+--+++---++
   |  20230317202717144|20230317202717144...| a|
   foo|20c4c9f3-2cbc-439...|   a| 11| foo|
   
+---++

[GitHub] [hudi] nikspatel03 opened a new issue, #8221: [SUPPORT] Can we use "Amazon S3 Multi-Region Access Points" with Hudi ?

2023-03-17 Thread via GitHub



nikspatel03 opened a new issue, #8221:
URL: https://github.com/apache/hudi/issues/8221

   Hello Team,
   
   Can we use Amazon S3 Multi-Region Access Points feature with hudi as a part 
of disaster recovery solution ? Is there any extra configuration required to 
use it ? 
   
   I read about this disaster recovery set up in Iceberg documentation: 
   Ref: https://iceberg.apache.org/docs/latest/aws/#s3-access-points
   
   Thanks,
   Nikul


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8219: [HUDI-5949] Check the write operation configured by user for better troubleshooting

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8219:
URL: https://github.com/apache/hudi/pull/8219#issuecomment-1474254403

   
   ## CI report:
   
   * 64fab4bc69f94c44aeba7065adcaea5d24cf9973 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15771)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8191: Unable to execute HTTP request | connection timeout issues

2023-03-17 Thread via GitHub



nsivabalan commented on issue #8191:
URL: https://github.com/apache/hudi/issues/8191#issuecomment-1474218516

   got it thanks, will watch out for fixes that went into 0.13.0 around 
connection leaks and will pick those for 0.12.3.
   thanks for confirming. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8220: [HUDI-8065] - Using Cached Consumer that clears between batch writes

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8220:
URL: https://github.com/apache/hudi/pull/8220#issuecomment-1474210693

   
   ## CI report:
   
   * 586f0e6c2bab9701e4017616dbb8b667387fcedd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15772)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

2023-03-17 Thread via GitHub



nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1474205280

   probably, here is what you can do
   1. query the table to find all duplicates. 
   2. store the dupes to some staging location (may be df.write.parquet). 
   3. issue deletes for these records to against hudi. 
   4. for the same batch, de-duplicate to pick one version of the record and 
ingest to hudi using upsert. 
   
   If anything crashes inbetween, you always have the staging data. this is 
just to ensure after deleting from hudi table, if your process crashes, you may 
have lost track of the records. bcoz, snapshot query is not going to return it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #8178: Duplicate data in MOR table Hudi

2023-03-17 Thread via GitHub



nsivabalan commented on issue #8178:
URL: https://github.com/apache/hudi/issues/8178#issuecomment-1474202023

   btw, `repair deduplicate ` does not work for MOR table :( 
   so you have to write some code in the application layer to fix the 
duplicates unfortunately. sorry about that. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8220: [HUDI-8065] - Using Cached Consumer that clears between batch writes

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8220:
URL: https://github.com/apache/hudi/pull/8220#issuecomment-1474195058

   
   ## CI report:
   
   * 586f0e6c2bab9701e4017616dbb8b667387fcedd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table

2023-03-17 Thread via GitHub



nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1474193835

   @MihawkZoro : did you try inline clustering. did that also result in deletes 
re-appearing again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danielfordfc commented on issue #8065: [SUPPORT] Deltastreamer AvroKafka Schema Evolution transiently failing in --continuous mode

2023-03-17 Thread via GitHub



danielfordfc commented on issue #8065:
URL: https://github.com/apache/hudi/issues/8065#issuecomment-1474170908

   So I know/knew very little about the code itself and have never actually 
used java before, but i've begun working out the code hierarchy for the 
deltastreamer and commenting parts of the code to explain to my colleagues, and 
I've started having a stab at a cached consumer that Clears its cache when the 
source is instantiated.
   
   Currently in a `draft`, but will learn and iterate slowly here. If you're 
bored please feel free to poke fun at me 😆 
   
   https://github.com/apache/hudi/pull/8220/files
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danielfordfc opened a new pull request, #8220: [HUDI-8065] - Using Cached Consumer that clears between batch writes

2023-03-17 Thread via GitHub



danielfordfc opened a new pull request, #8220:
URL: https://github.com/apache/hudi/pull/8220

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8218: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8218:
URL: https://github.com/apache/hudi/pull/8218#issuecomment-1474103127

   
   ## CI report:
   
   * b2db2061014b8f7802e73876c7f96230333c2022 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15770)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8219: [HUDI-5949] Check the write operation configured by user for better troubleshooting

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8219:
URL: https://github.com/apache/hudi/pull/8219#issuecomment-1474037956

   
   ## CI report:
   
   * 64fab4bc69f94c44aeba7065adcaea5d24cf9973 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15771)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8219: [HUDI-5949] Check the write operation configured by user for better troubleshooting

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8219:
URL: https://github.com/apache/hudi/pull/8219#issuecomment-1474027492

   
   ## CI report:
   
   * 64fab4bc69f94c44aeba7065adcaea5d24cf9973 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5949) Check the write operation configured by user for better troubleshooting

2023-03-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5949:
-
Labels: pull-request-available  (was: )

> Check the write operation configured by user for better troubleshooting
> ---
>
> Key: HUDI-5949
> URL: https://issues.apache.org/jira/browse/HUDI-5949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: Wechar
>Priority: Blocker
>  Labels: pull-request-available
>
>  *Background:*
> We find that Spark-Hudi insert data will return a *HoodieException: (Part -) 
> field not found in record. Acceptable fields were :[uuid, name, price]*
> {code:bash}
>   ..
>   at 
> org.apache.hudi.index.simple.HoodieSimpleIndex.fetchRecordLocationsForAffectedPartitions(HoodieSimpleIndex.java:142)
>   at 
> org.apache.hudi.index.simple.HoodieSimpleIndex.tagLocationInternal(HoodieSimpleIndex.java:113)
>   at 
> org.apache.hudi.index.simple.HoodieSimpleIndex.tagLocation(HoodieSimpleIndex.java:91)
>   at 
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:51)
>   at 
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:34)
>   at 
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
>   ... 52 more
> Caused by: org.apache.hudi.exception.HoodieException: (Part -) field not 
> found in record. Acceptable fields were :[uuid, name, price]
>   at 
> org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:530)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$11(HoodieSparkSqlWriter.scala:305)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1509)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
> time 20230317222153522
>   at 
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)
> {code}
> {*}Steps to Reproduce{*}:
> {code:sql}
> -- 1. create a table without preCombineKey
> CREATE TABLE default.test_hudi_default (
>   uuid int,
>   name string,
>   price double
> ) USING hudi;
> -- 2. config write operation to upsert
> set hoodie.datasource.write.operation=upsert;
> -- 3. insert data and exception occurs
> insert into default.test_hudi_default select 1, 'name1', 1.1;
> {code}
> *Root Cause:*
> Hudi does not support upsert for table without preCombineKey, but this 
> exception message may confuse the users.
> *Improvement:*
> We can check the user configured write operation and provide a more specific 
> exception message, it will help user understand what's wrong immediately. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] wecharyu opened a new pull request, #8219: [HUDI-5949] Check the write operation configured by user for better troubleshooting

2023-03-17 Thread via GitHub



wecharyu opened a new pull request, #8219:
URL: https://github.com/apache/hudi/pull/8219

   ### Change Logs
   Add the insert operation config check for better troubletshooting, 
especially for the case where upsert to table without preCombineKey.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-5949) Check the write operation configured by user for better troubleshooting

2023-03-17 Thread Wechar (Jira)

Wechar created HUDI-5949:


 Summary: Check the write operation configured by user for better 
troubleshooting
 Key: HUDI-5949
 URL: https://issues.apache.org/jira/browse/HUDI-5949
 Project: Apache Hudi
  Issue Type: Improvement
  Components: configs
Reporter: Wechar


 *Background:*

We find that Spark-Hudi insert data will return a *HoodieException: (Part -) 
field not found in record. Acceptable fields were :[uuid, name, price]*
{code:bash}
  ..
at 
org.apache.hudi.index.simple.HoodieSimpleIndex.fetchRecordLocationsForAffectedPartitions(HoodieSimpleIndex.java:142)
at 
org.apache.hudi.index.simple.HoodieSimpleIndex.tagLocationInternal(HoodieSimpleIndex.java:113)
at 
org.apache.hudi.index.simple.HoodieSimpleIndex.tagLocation(HoodieSimpleIndex.java:91)
at 
org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:51)
at 
org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:34)
at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
... 52 more
Caused by: org.apache.hudi.exception.HoodieException: (Part -) field not found 
in record. Acceptable fields were :[uuid, name, price]
at 
org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:530)
at 
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$11(HoodieSparkSqlWriter.scala:305)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1509)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
time 20230317222153522
at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)

{code}
{*}Steps to Reproduce{*}:
{code:sql}
-- 1. create a table without preCombineKey
CREATE TABLE default.test_hudi_default (
  uuid int,
  name string,
  price double
) USING hudi;

-- 2. config write operation to upsert
set hoodie.datasource.write.operation=upsert;

-- 3. insert data and exception occurs
insert into default.test_hudi_default select 1, 'name1', 1.1;
{code}

*Root Cause:*
Hudi does not support upsert for table without preCombineKey, but this 
exception message may confuse the users.

*Improvement:*
We can check the user configured write operation and provide a more specific 
exception message, it will help user understand what's wrong immediately. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] soumilshah1995 commented on issue #8207: [SUPPORT] Hudi 0.13 Consistent Hashing Issue for MOR Tables

2023-03-17 Thread via GitHub



soumilshah1995 commented on issue #8207:
URL: https://github.com/apache/hudi/issues/8207#issuecomment-1473939179

   can you share code snippets iam not sure i understand 
   i already have UPSERT Method 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8218: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8218:
URL: https://github.com/apache/hudi/pull/8218#issuecomment-1473885990

   
   ## CI report:
   
   * b2db2061014b8f7802e73876c7f96230333c2022 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15770)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8218: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8218:
URL: https://github.com/apache/hudi/pull/8218#issuecomment-1473874774

   
   ## CI report:
   
   * b2db2061014b8f7802e73876c7f96230333c2022 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #8218: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



danny0405 commented on code in PR #8218:
URL: https://github.com/apache/hudi/pull/8218#discussion_r1140250519


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionEvaluators.java:
##
@@ -255,9 +255,12 @@ public static NotEqualTo getInstance() {
 
 @Override
 protected boolean eval(@NotNull Object val, ColumnStats columnStats, 
LogicalType type) {
-  // because the bounds are not necessarily a min or max value, this 
cannot be answered using them.
-  // notEq(col, X) with (X, Y) doesn't guarantee that X is a value in col.
-  return true;

Review Comment:
Suggest to keep this comment and add specific explanation for constant 
column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] beyond1920 closed pull request #8101: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



beyond1920 closed pull request #8101: [HUDI-5879] Extends evaluators to support 
evaluate based on column values
URL: https://github.com/apache/hudi/pull/8101


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] beyond1920 commented on a diff in pull request #8101: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



beyond1920 commented on code in PR #8101:
URL: https://github.com/apache/hudi/pull/8101#discussion_r1140231739


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionEvaluators.java:
##
@@ -156,12 +156,20 @@ public static Evaluator fromExpression(CallExpression 
expr) {
   public interface Evaluator extends Serializable {
 
 /**
- * Decides whether it's possible to match based on the column stats.
+ * Evaluates whether it's possible to match based on the column stats.
  *
  * @param columnStatsMap column statistics
- * @return
+ * @return false if it's not possible to match, true otherwise.
  */
 boolean eval(Map columnStatsMap);
+
+/**
+ * Evaluates whether it matches based on the column values.
+ *
+ * @param columnValues column values
+ * @return true if it's matches, false otherwise.
+ */
+boolean eval(Object[] columnValues);
   }

Review Comment:
   Open a new [PR-8218](https://github.com/apache/hudi/pull/8218)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] beyond1920 commented on a diff in pull request #8101: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



beyond1920 commented on code in PR #8101:
URL: https://github.com/apache/hudi/pull/8101#discussion_r1140231739


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionEvaluators.java:
##
@@ -156,12 +156,20 @@ public static Evaluator fromExpression(CallExpression 
expr) {
   public interface Evaluator extends Serializable {
 
 /**
- * Decides whether it's possible to match based on the column stats.
+ * Evaluates whether it's possible to match based on the column stats.
  *
  * @param columnStatsMap column statistics
- * @return
+ * @return false if it's not possible to match, true otherwise.
  */
 boolean eval(Map columnStatsMap);
+
+/**
+ * Evaluates whether it matches based on the column values.
+ *
+ * @param columnValues column values
+ * @return true if it's matches, false otherwise.
+ */
+boolean eval(Object[] columnValues);
   }

Review Comment:
   Open a new [PR](https://github.com/apache/hudi/pull/8218)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] beyond1920 opened a new pull request, #8218: [HUDI-5879] Extends evaluators to support evaluate based on column values

2023-03-17 Thread via GitHub



beyond1920 opened a new pull request, #8218:
URL: https://github.com/apache/hudi/pull/8218

   ### Change Logs
   The pr aims to extends evaluators to support evaluate based on column value. 
 It could be wrapped in a `ColumnStats` object, which min is same with max and 
nullCnt is 0.
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   NA
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] haripriyarhp opened a new issue, #8217: [SUPPORT] Async compaction & ingestion performance

2023-03-17 Thread via GitHub



haripriyarhp opened a new issue, #8217:
URL: https://github.com/apache/hudi/issues/8217

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have a spark structured streaming job writing to MoR in S3. Based on the 
docs here 
https://hudi.apache.org/docs/next/compaction#spark-structured-streaming , I 
have set the properties for async compaction. As per my understanding, the 
compaction runs asynchronously so that ingestion also takes place in parallel. 
But what I observed is that the spark assigns all the resources to the 
compaction job and only when it is finished, it continues with the ingestion 
even though both the jobs are running in parallel. Is there something like 
--delta-sync-scheduling-weight", "--compact-scheduling-weight", 
""--delta-sync-scheduling-minshare", and "--compact-scheduling-minshare" for 
spark structured streaming for ingestion and compaction to run in parallel with 
resource allocation ?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   I expect compaction and ingestion is happening in parallel
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : on spark operator on k8s
   
   Here is the screenshot of the micro batches for the spark structured 
streaming job. Normally each batch takes 4-6 mins. But the batches immediately 
following compaction is taking some time. 
   
![async_compaction](https://user-images.githubusercontent.com/109664817/225903090-f2a8aaaf-aa34-49d6-aa59-c7127849832c.PNG)
   
   Time taken for a normal batch = 206
   
![sensor_morbloom_124_details](https://user-images.githubusercontent.com/109664817/225904984-b13864e8-ddb3-4bd9-8f14-7330d05956f6.PNG)
   
   Batch=207 
   Here you can see the last job 2961 is HoodieCompactionPlanGenerator which is 
adding 8.6 mins overhead to this batch.
   
![sensor_morbloom_125_details](https://user-images.githubusercontent.com/109664817/225903294-18ee13ad-bafc-4f57-a820-cf7f9eb6237a.PNG)
   
   Batch=208
   Here you can see it starts with job 2963 but 2964-2972 is for compaction and 
you can see it continues with 2973 only after compaction jobs are finished even 
though it runs on a different thread(see below pic). And time taken for 
different stages like Load base files, Building workload profile and getting 
small files has drastically increased. Refer batch 206 for normal time taken. 
   
![sensor_morbloom_126_details](https://user-images.githubusercontent.com/109664817/225903283-db13427d-3338-44e0-93a2-e614451b6dd7.PNG)
   
   compaction jobs (2964-2972)
   
![sensor_morbloom_async_compaction](https://user-images.githubusercontent.com/109664817/225909669-6a356796-4a5e-459c-a001-e20f30caeac2.PNG)
   
   Is there something that can be done to improve this performance? 
   
   **Additional context**
   
   using hudi-spark3.1-bundle_2.12-0.13.0.jar along with hadoop-aws_3.1.2.jar 
and aws-java-sdk-bundle_1.11.271.jar. 
   
   job configs
   
"hoodie.table.name" -> tableName,
 "path" -> "s3a://path/Hudi/".concat(tableName),
 "hoodie.datasource.write.table.name" -> tableName,
 "hoodie.datasource.write.table.type" -> MERGE_ON_READ,
 "hoodie.datasource.write.operation" -> "upsert",
 "hoodie.datasource.write.recordkey.field" -> "col5,col6,col7",
 "hoodie.datasource.write.partitionpath.field" -> "col1,col2,col3,col4",
 "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.ComplexKeyGenerator",
 "hoodie.datasource.write.hive_style_partitioning" -> "true",
 //Cleaning options
 "hoodie.clean.automatic" -> "true",
 "hoodie.clean.max.commits" -> "3",
 //"hoodie.clean.async" -> "true",
 //hive_sync_options
 "hoodie.datasource.hive_sync.partition_fields" -> 
"col1,col2,col3,col4",
 "hoodie.datasource.hive_sync.database" -> dbName,
 "hoodie.datasource.hive_sync.table" -> tableName,
 "hoodie.datasource.hive_sync.enable" -> "true",
 "hoodie.datasource.hive_sync.mode" -> "hms",
 "hoodie.datasource.hive_sync.partition_extractor_class" -> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
 "hoodie.upsert.shuffle.parallelism" -> "200",
 "hoodie.insert.shuffle.parallelism" -> "200",
 "hoodie.datasource.compaction.async.enable" -> true, 
 "hoodie.compact.inline.max.delta.commits" -> "10",
 "hoodie.index.type" -> "BLOOM"

[GitHub] [hudi] flashJd commented on pull request #6643: [HUDI-4823]Add read_optimize spark_session config to use in spark-sql

2023-03-17 Thread via GitHub



flashJd commented on PR #6643:
URL: https://github.com/apache/hudi/pull/6643#issuecomment-1473771328

   > @flashJd : Can you look at the failing testcase
   
   I'll look at it soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8214: [HUDI-5948] Apply maven CI friendly version

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8214:
URL: https://github.com/apache/hudi/pull/8214#issuecomment-1473636267

   
   ## CI report:
   
   * 08b4ead8bfb527d465670fa6c6a0439077753238 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15769)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #3769: [HUDI-2005] Fixing partition path creation in AbstractTableFileSystemView

2023-03-17 Thread via GitHub



hbgstc123 commented on code in PR #3769:
URL: https://github.com/apache/hudi/pull/3769#discussion_r1140045032


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -317,7 +315,13 @@ private void ensurePartitionLoadedCorrectly(String 
partition) {
* @throws IOException
*/
   protected FileStatus[] listPartition(Path partitionPath) throws IOException {
-return metaClient.getFs().listStatus(partitionPath);
+// Create the path if it does not exist already
+if (!metaClient.getFs().exists(partitionPath)) {
+  metaClient.getFs().mkdirs(partitionPath);

Review Comment:
   @nsivabalan hi, may i ask why we need to create dir when not exists. This 
behavior have a problem after drop partition, clean action will delete the dir. 
But this dropped partition maybe listed again, and will result in a empty dir, 
can we just `return new FileStatus[0]; ` here without create new dir.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] gamblewin opened a new issue, #8216: [SUPPORT]Compile Error of hudi-0.13.0 "org.apache.hadoop.hive.common.type can not be converted to java.sql.Date"

2023-03-17 Thread via GitHub



gamblewin opened a new issue, #8216:
URL: https://github.com/apache/hudi/issues/8216

   When compiling source code of hudi-0.13.0, report compile error:
   "HiveAvroSerializer.java[302, 93]: org.apache.hadoop.hive.common.type can 
not be converted to java.sql.Date"
   
   * Hudi version :0.13.0
   * Hive version :3.1.3
   * Hadoop version :3.0.0
   * Running on Docker? (yes/no) :no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8214: [HUDI-5948] Apply maven CI friendly version

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8214:
URL: https://github.com/apache/hudi/pull/8214#issuecomment-1473578468

   
   ## CI report:
   
   * 08b4ead8bfb527d465670fa6c6a0439077753238 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15769)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LiJie20190102 opened a new issue, #8215: [SUPPORT]

2023-03-17 Thread via GitHub



LiJie20190102 opened a new issue, #8215:
URL: https://github.com/apache/hudi/issues/8215

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I use HoodieDeltaStreamer to continuously receive data from kafka and 
synchronize it to hive, but when I use the spark-shell to query table data, I 
find that the table data has not changed.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Prepare Kafka's topic and continuously produce data
   2. Utilize hudi-utilities-bundle_2.12-0.13.0.jar continuously receives data 
from Kafka and synchronizes it to hive
   3.  Using spark-shell queries: spark.sql("select * from 
test_aa").show(10,false)
   
   **Expected behavior**
   In the spark-shell, each time you execute "select * from test_aa", the 
latest data in the query table is queried
   
   
   **Environment Description**
   
   * Hudi version : 2.12-0.13.0
   
   * Spark version : 3.2.3
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.4
   
   * Storage (HDFS/S3/GCS..) : hdfs
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8214: [HUDI-5948] Apply maven CI friendly version

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8214:
URL: https://github.com/apache/hudi/pull/8214#issuecomment-1473554736

   
   ## CI report:
   
   * 08b4ead8bfb527d465670fa6c6a0439077753238 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-5948) Apply maven CI friendly version

2023-03-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5948:
-
Labels: pull-request-available  (was: )

> Apply maven CI friendly version
> ---
>
> Key: HUDI-5948
> URL: https://issues.apache.org/jira/browse/HUDI-5948
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> Apply maven CI friendly version to simplify the version management.
> After apply, all modules versions can be modified by a single code change.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] YannByron opened a new pull request, #8214: [HUDI-5948] Apply maven CI friendly version

2023-03-17 Thread via GitHub



YannByron opened a new pull request, #8214:
URL: https://github.com/apache/hudi/pull/8214

   ### Change Logs
   
   Apply maven CI friendly version to simplify the version management.
   After apply, all modules versions can be modified by a single code change, 
and maven 3.5.0+ is required.
   
   See the details: https://maven.apache.org/maven-ci-friendly.html
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   --
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8202: [MINOR] Fix typo in ValidateMetadataTableFilesProcedure

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8202:
URL: https://github.com/apache/hudi/pull/8202#issuecomment-1473523636

   
   ## CI report:
   
   * d0f787f7cca4909e4fff3b8690bf6488355e059d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15750)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15768)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-5948) Apply maven CI friendly version

2023-03-17 Thread Yann Byron (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-5948:


Assignee: Yann Byron

> Apply maven CI friendly version
> ---
>
> Key: HUDI-5948
> URL: https://issues.apache.org/jira/browse/HUDI-5948
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>
> Apply maven CI friendly version to simplify the version management.
> After apply, all modules versions can be modified by a single code change.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-5948) Apply maven CI friendly version

2023-03-17 Thread Yann Byron (Jira)

Yann Byron created HUDI-5948:


 Summary: Apply maven CI friendly version
 Key: HUDI-5948
 URL: https://issues.apache.org/jira/browse/HUDI-5948
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron


Apply maven CI friendly version to simplify the version management.

After apply, all modules versions can be modified by a single code change.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] maheshguptags commented on issue #8195: [SUPPORT] Clustering is not happening on Flink Hudi

2023-03-17 Thread via GitHub



maheshguptags commented on issue #8195:
URL: https://github.com/apache/hudi/issues/8195#issuecomment-1473411812

   @danny0405 with your suggestion it is not working with MOR and COW in Insert 
type. could you please share the support matrics for compaction clustering ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] haripriyarhp opened a new issue, #8213: [SUPPORT] Error while setting OCC in spark structured streaming

2023-03-17 Thread via GitHub



haripriyarhp opened a new issue, #8213:
URL: https://github.com/apache/hudi/issues/8213

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I have a spark structured streaming job reading from kafka and writing to 
MoR in S3. I am using hudi-spark3.1-bundle_2.12-0.13.0.jar along with 
aws-java-sdk-bundle_1.11.271.jar & hadoop-aws_3.1.2.jar. Based on the article 
here https://hudi.apache.org/docs/next/concurrency_control , I added the below 
configs to my job 
   "hoodie.write.concurrency.mode" -> "optimistic_concurrency_control",
   "hoodie.cleaner.policy.failed.writes" -> "LAZY",
   "hoodie.write.lock.provider" -> 
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
   "hoodie.write.lock.dynamodb.table" -> "hudi_partitions"
   but then it gave the error 
   ERROR SensorStreaming$: No enum constant 
org.apache.hudi.common.model.WriteConcurrencyMode.optimistic_concurrency_control
   
   full configs are 
 "hoodie.table.name" -> tableName,
 "path" -> "s3a://path/Hudi/".concat(tableName),
 "hoodie.datasource.write.table.name" -> tableName,
 "hoodie.datasource.write.table.type" -> MERGE_ON_READ,
 "hoodie.datasource.write.operation" -> "upsert",
 "hoodie.datasource.write.recordkey.field" -> "col5,col6,col7",
 "hoodie.datasource.write.partitionpath.field" -> "col1,col2,col3,col4",
 "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.ComplexKeyGenerator",
 "hoodie.datasource.write.hive_style_partitioning" -> "true",
 //Cleaning options
 "hoodie.clean.automatic" -> "true",
 "hoodie.clean.max.commits" -> "3",
 //"hoodie.clean.async" -> "true",
 //hive_sync_options
 "hoodie.datasource.hive_sync.partition_fields" -> 
"col1,col2,col3,col4",
 "hoodie.datasource.hive_sync.database" -> dbName,
 "hoodie.datasource.hive_sync.table" -> tableName,
 "hoodie.datasource.hive_sync.enable" -> "true",
 "hoodie.datasource.hive_sync.mode" -> "hms",
 "hoodie.datasource.hive_sync.partition_extractor_class" -> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
 "hoodie.upsert.shuffle.parallelism" -> "200",
 "hoodie.insert.shuffle.parallelism" -> "200",
 "hoodie.datasource.compaction.async.enable" -> true, 
 "hoodie.compact.inline.max.delta.commits" -> "10",
 "hoodie.index.type" -> "BLOOM"
 //"hoodie.metadata.index.bloom.filter.enable" -> "true",
 //"hoodie.metadata.index.column.stats.enable" -> "true",
 //"hoodie.enable.data.skipping" -> "true"
 //"hoodie.write.concurrency.mode" -> "optimistic_concurrency_control",
 //"hoodie.cleaner.policy.failed.writes" -> "LAZY",
 //"hoodie.write.lock.provider" -> 
"org.apache.hudi.client.transaction.lock.DynamoDBBasedLockProvider",
 //"hoodie.write.lock.dynamodb.table" -> "hudi_partitions"
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1.2
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : spark operator on kubernetes
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```ERROR SensorStreaming$: No enum constant 
org.apache.hudi.common.model.WriteConcurrencyMode.optimistic_concurrency_control```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8164: [SUPPORT] Flink KryoSerializer cannot be cast to class org.apache.flink.table.runtime.typeutils.AbstractRowDataSerializer

2023-03-17 Thread via GitHub



danny0405 commented on issue #8164:
URL: https://github.com/apache/hudi/issues/8164#issuecomment-1473313206

   
   1. Did you enable the ckp yet? Flink sink relies the ckp success event for 
Hudi trasanction commiting;
   2. Both bulk_insert and append_write use the `BulkInsertWriterHelper` to 
write the parquet files direcly, there is no UPSERTs, if `FLINK_STATE` is used, 
things are very diffrent, the `StreamWriteFunction` would kick in;
   3. You can just set up the compress options within the Flink SQL options, or 
the `HoodiePipeline#options` you have used:
   
   e.g.
   
   ```sql
   create table xxx(
   ) with (
 'connector' = 'hudi',
 'hoodie.parquet.compression.codec' = 'gzip'
   );
   ```
   
   ```java
   HoodiePipeline.builder("xxx")
   .option("hoodie.parquet.compression.codec", "gzip")
   ```
   
   The default codec is already gzip, probably that is the reason you do not 
perceive any difference


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8202: [MINOR] Fix typo in ValidateMetadataTableFilesProcedure

2023-03-17 Thread via GitHub



hudi-bot commented on PR #8202:
URL: https://github.com/apache/hudi/pull/8202#issuecomment-1473297291

   
   ## CI report:
   
   * d0f787f7cca4909e4fff3b8690bf6488355e059d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15750)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15768)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 closed pull request #6917: [HUDI-5005] Flink stream write reuse abort instant will lead to coordinator delete file not right.

2023-03-17 Thread via GitHub



danny0405 closed pull request #6917: [HUDI-5005] Flink stream write reuse abort 
instant will lead to coordinator delete file not right.
URL: https://github.com/apache/hudi/pull/6917


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 104 matches

Mail list logo