[jira] [Commented] (HUDI-4632) Remove the force active property for flink1.14 profile

2022-08-18 Thread Sagar Sumit (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581675#comment-17581675
 ] 

Sagar Sumit commented on HUDI-4632:
---

Fix under review https://github.com/apache/hudi/pull/6437

> Remove the force active property for flink1.14 profile
> --
>
> Key: HUDI-4632
> URL: https://issues.apache.org/jira/browse/HUDI-4632
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.1
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-4632) Remove the force active property for flink1.14 profile

2022-08-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reopened HUDI-4632:
---

Reopening due to [https://github.com/apache/hudi/issues/6422]

Build fails for default profile after removing the force activation property 
for flink1.14 profile.

> Remove the force active property for flink1.14 profile
> --
>
> Key: HUDI-4632
> URL: https://issues.apache.org/jira/browse/HUDI-4632
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.1
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6443: [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set

2022-08-18 Thread GitBox


hudi-bot commented on PR #6443:
URL: https://github.com/apache/hudi/pull/6443#issuecomment-1220256737

   
   ## CI report:
   
   * 8e5db16fd43d95d376bd58c11568bd3c70b47acf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10817)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-18 Thread GitBox


hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1220256551

   
   ## CI report:
   
   * 8b8e05269c88a057840b743cc60f9163b2348db2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10815)
 
   * 2f1e53d384281d4729df2367c32b8f22d7e67c87 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10823)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-18 Thread GitBox


hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1220251702

   
   ## CI report:
   
   * 8b8e05269c88a057840b743cc60f9163b2348db2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10815)
 
   * 2f1e53d384281d4729df2367c32b8f22d7e67c87 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6446: [HUDI-4645] Flink batch incremental changelog query

2022-08-18 Thread GitBox


hudi-bot commented on PR #6446:
URL: https://github.com/apache/hudi/pull/6446#issuecomment-1220249418

   
   ## CI report:
   
   * 579b8e20441d54b2eb672218f8e7dbd1d0a1ebaf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10820)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6442: [HUDI-4449] Support DataSourceV2 Read for Spark3.2

2022-08-18 Thread GitBox


hudi-bot commented on PR #6442:
URL: https://github.com/apache/hudi/pull/6442#issuecomment-1220249387

   
   ## CI report:
   
   * ecc3e06795ac1f4c6439abb9fa53b666d2f5a8eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10816)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5875: [SUPPORT] Hoodie Delta streamer Job with Kafka Source fetching the same offset again and again Commiting the same offset again and again

2022-08-18 Thread GitBox


nsivabalan commented on issue #5875:
URL: https://github.com/apache/hudi/issues/5875#issuecomment-1220238494

   as I pointed out in the other github issue, if kafka checkpoint type is 
timestamp, we don't support auto resetting to LATEST yet. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5822: Hudi Clustering not working

2022-08-18 Thread GitBox


nsivabalan commented on issue #5822:
URL: https://github.com/apache/hudi/issues/5822#issuecomment-1220237957

   thanks!. curious to know what was missing or how did you resolve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #5822: Hudi Clustering not working

2022-08-18 Thread GitBox


nsivabalan closed issue #5822: Hudi Clustering not working
URL: https://github.com/apache/hudi/issues/5822


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes

2022-08-18 Thread GitBox


hudi-bot commented on PR #6431:
URL: https://github.com/apache/hudi/pull/6431#issuecomment-1220222643

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 435e3afbae2f62db1e661ad46aabb50df421049c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10819)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6446: [HUDI-4645] Flink batch incremental changelog query

2022-08-18 Thread GitBox


hudi-bot commented on PR #6446:
URL: https://github.com/apache/hudi/pull/6446#issuecomment-1220222675

   
   ## CI report:
   
   * 579b8e20441d54b2eb672218f8e7dbd1d0a1ebaf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4645) Flink batch incremental changelog query

2022-08-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4645:
-
Labels: pull-request-available  (was: )

> Flink batch incremental changelog query 
> 
>
> Key: HUDI-4645
> URL: https://issues.apache.org/jira/browse/HUDI-4645
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yonghua jian
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] flashJd opened a new pull request, #6446: [HUDI-4645] Flink batch incremental changelog query

2022-08-18 Thread GitBox


flashJd opened a new pull request, #6446:
URL: https://github.com/apache/hudi/pull/6446

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes

2022-08-18 Thread GitBox


hudi-bot commented on PR #6431:
URL: https://github.com/apache/hudi/pull/6431#issuecomment-1220221073

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 435e3afbae2f62db1e661ad46aabb50df421049c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4645) Flink batch incremental changelog query

2022-08-18 Thread yonghua jian (Jira)
yonghua jian created HUDI-4645:
--

 Summary: Flink batch incremental changelog query 
 Key: HUDI-4645
 URL: https://issues.apache.org/jira/browse/HUDI-4645
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: yonghua jian






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6445: [HUDI-4644] Change default flink profile to 1.15.x

2022-08-18 Thread GitBox


hudi-bot commented on PR #6445:
URL: https://github.com/apache/hudi/pull/6445#issuecomment-1220221113

   
   ## CI report:
   
   * 4251b68ce29a78a88f5355fb106bf5ca9444fec1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10818)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] junyuc25 commented on pull request #6431: Shutdown CloudWatch reporter when query completes

2022-08-18 Thread GitBox


junyuc25 commented on PR #6431:
URL: https://github.com/apache/hudi/pull/6431#issuecomment-1220219273

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6445: [HUDI-4644] Change default flink profile to 1.15.x

2022-08-18 Thread GitBox


hudi-bot commented on PR #6445:
URL: https://github.com/apache/hudi/pull/6445#issuecomment-1220219205

   
   ## CI report:
   
   * 4251b68ce29a78a88f5355fb106bf5ca9444fec1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6443: [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set

2022-08-18 Thread GitBox


hudi-bot commented on PR #6443:
URL: https://github.com/apache/hudi/pull/6443#issuecomment-1220217068

   
   ## CI report:
   
   * 8e5db16fd43d95d376bd58c11568bd3c70b47acf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10817)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes

2022-08-18 Thread GitBox


hudi-bot commented on PR #6431:
URL: https://github.com/apache/hudi/pull/6431#issuecomment-1220217036

   
   ## CI report:
   
   * 0e37748dafc19e6764a2cad903d8938ac3266961 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10814)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-18 Thread GitBox


hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1220217014

   
   ## CI report:
   
   * 8b8e05269c88a057840b743cc60f9163b2348db2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10815)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amit-ranjan-de commented on issue #5916: [SUPPORT] `show fsview latest` throwing IllegalStateException...pending compactions for merge_on_read table

2022-08-18 Thread GitBox


amit-ranjan-de commented on issue #5916:
URL: https://github.com/apache/hudi/issues/5916#issuecomment-1220212653

   Hi @minihippo, in my case I don't have the `20220615161228.compaction` or 
`20220615172826.compaction` files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6437: [HUDI-4632] Map Flink artifact ids correctly as per scala dependence

2022-08-18 Thread GitBox


danny0405 commented on PR #6437:
URL: https://github.com/apache/hudi/pull/6437#issuecomment-1220208442

   Thanks, i have fired a fix in https://github.com/apache/hudi/pull/6445/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4644) Change default flink profile to 1.15.x

2022-08-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4644:
-
Labels: pull-request-available  (was: )

> Change default flink profile to 1.15.x
> --
>
> Key: HUDI-4644
> URL: https://issues.apache.org/jira/browse/HUDI-4644
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 opened a new pull request, #6445: [HUDI-4644] Change default flink profile to 1.15.x

2022-08-18 Thread GitBox


danny0405 opened a new pull request, #6445:
URL: https://github.com/apache/hudi/pull/6445

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4644) Change default flink profile to 1.15.x

2022-08-18 Thread Danny Chen (Jira)
Danny Chen created HUDI-4644:


 Summary: Change default flink profile to 1.15.x
 Key: HUDI-4644
 URL: https://issues.apache.org/jira/browse/HUDI-4644
 Project: Apache Hudi
  Issue Type: Task
  Components: flink
Reporter: Danny Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] novisfff opened a new issue, #6444: [SUPPORT] Timeline Service MarkerDirState thread safe issue

2022-08-18 Thread GitBox


novisfff opened a new issue, #6444:
URL: https://github.com/apache/hudi/issues/6444

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Request getCreateAndMergeMarkers may throw ConcurrentModificationException.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ``` 
   2022-08-18 07:34:17,809 ERROR 
org.apache.hudi.timeline.service.RequestHandler  [] - Got runtime 
exception servicing request 
markerdirpath=oss%3A%2F%2Fxxx%2F.hoodie%2F.temp%2F20220818073110311
   java.util.ConcurrentModificationException: null
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1561) 
~[?:1.8.0_302]
at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) 
~[?:1.8.0_302]
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) 
~[?:1.8.0_302]
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) 
~[?:1.8.0_302]
at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
~[?:1.8.0_302]
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) 
~[?:1.8.0_302]
at 
org.apache.hudi.timeline.service.handlers.MarkerHandler.getCreateAndMergeMarkers(MarkerHandler.java:129)
 ~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.timeline.service.RequestHandler.lambda$registerMarkerAPI$25(RequestHandler.java:454)
 ~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:517)
 ~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
io.javalin.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:22) 
~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at io.javalin.Javalin.lambda$addHandler$0(Javalin.java:606) 
~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:46) 
~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
io.javalin.core.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:17) 
~[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
io.javalin.core.JavalinServlet$service$1.invoke(JavalinServlet.kt:143) 
[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
io.javalin.core.JavalinServlet$service$2.invoke(JavalinServlet.kt:41) 
[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at io.javalin.core.JavalinServlet.service(JavalinServlet.kt:107) 
[hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
io.javalin.core.util.JettyServerUtil$initialize$httpHandler$1.doHandle(JettyServerUtil.kt:72)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1668)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.handler.HandlerList.handle(HandlerList.java:61)
 [hudi-flink1.13-bundle_2.11-0.12.0-SNAPSHOT.jar:0.12.0-SNAPSHOT]
at 
org.apache.hudi.org.apache.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)

[jira] [Closed] (HUDI-4013) Document all manual tests done as part of 0.11 release certification

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4013.

Resolution: Won't Do

> Document all manual tests done as part of 0.11 release certification 
> -
>
> Key: HUDI-4013
> URL: https://issues.apache.org/jira/browse/HUDI-4013
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4013) Document all manual tests done as part of 0.11 release certification

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4013:
-
Fix Version/s: (was: 0.12.1)

> Document all manual tests done as part of 0.11 release certification 
> -
>
> Key: HUDI-4013
> URL: https://issues.apache.org/jira/browse/HUDI-4013
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4272) Auto detect partitions without PartitionValueExtractor

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4272.

Resolution: Won't Do

Abandoned due to change of plan: using config inference instead

> Auto detect partitions without PartitionValueExtractor
> --
>
> Key: HUDI-4272
> URL: https://issues.apache.org/jira/browse/HUDI-4272
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Priority: Major
>
> Deprecate all PartitionValueExtractor; users do not need to configure it for 
> meta sync.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4013) Document all manual tests done as part of 0.11 release certification

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4013:
-
Issue Type: Task  (was: Improvement)

> Document all manual tests done as part of 0.11 release certification 
> -
>
> Key: HUDI-4013
> URL: https://issues.apache.org/jira/browse/HUDI-4013
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4012) Evaluate support for spark2 and scala12

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4012:
-
Issue Type: Task  (was: Improvement)

> Evaluate support for spark2 and scala12
> ---
>
> Key: HUDI-4012
> URL: https://issues.apache.org/jira/browse/HUDI-4012
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> Evaluate support for spark2 and scala12 and may be deprecate the usage going 
> forward. 
>  
> check stats from nexxus. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] dongkelun commented on issue #6410: [SUPPORT] MergeInto syntax WHEN MATCHED is not optional but must be set.

2022-08-18 Thread GitBox


dongkelun commented on issue #6410:
URL: https://github.com/apache/hudi/issues/6410#issuecomment-1220186095

   I created a PR to try to solve this 
problem:[https://github.com/apache/hudi/pull/6443](https://github.com/apache/hudi/pull/6443)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4012) Evaluate support for spark2 and scala12

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4012:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Evaluate support for spark2 and scala12
> ---
>
> Key: HUDI-4012
> URL: https://issues.apache.org/jira/browse/HUDI-4012
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> Evaluate support for spark2 and scala12 and may be deprecate the usage going 
> forward. 
>  
> check stats from nexxus. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4272) Auto detect partitions without PartitionValueExtractor

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4272:
-
Fix Version/s: (was: 0.12.1)

> Auto detect partitions without PartitionValueExtractor
> --
>
> Key: HUDI-4272
> URL: https://issues.apache.org/jira/browse/HUDI-4272
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Priority: Major
>
> Deprecate all PartitionValueExtractor; users do not need to configure it for 
> meta sync.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6443: [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set

2022-08-18 Thread GitBox


hudi-bot commented on PR #6443:
URL: https://github.com/apache/hudi/pull/6443#issuecomment-1220185753

   
   ## CI report:
   
   * 8e5db16fd43d95d376bd58c11568bd3c70b47acf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4486) validate the coming configs and table name when create an external table

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4486:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> validate the coming configs and table name when create an external table
> 
>
> Key: HUDI-4486
> URL: https://issues.apache.org/jira/browse/HUDI-4486
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4457) Make sure IT docker test return code non-zero when failed

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4457:
-
Priority: Critical  (was: Major)

> Make sure IT docker test return code non-zero when failed
> -
>
> Key: HUDI-4457
> URL: https://issues.apache.org/jira/browse/HUDI-4457
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Raymond Xu
>Priority: Critical
> Fix For: 0.12.1
>
>
> IT testcase where docker command runs and returns exit code 0, but test 
> actually failed. This will be misleading for troubleshooting.
> TODO
> 1. verify the behavior
> 2. fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3787) Implement redis index

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3787:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Implement redis index
> -
>
> Key: HUDI-3787
> URL: https://issues.apache.org/jira/browse/HUDI-3787
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> * _Redis support standalong and cluster mode_
>  * _support flink and spark to operate index_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4483) Fix checkstyle on scala code and integ-test module

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4483:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Fix checkstyle on scala code and integ-test module
> --
>
> Key: HUDI-4483
> URL: https://issues.apache.org/jira/browse/HUDI-4483
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.13.0
>
>
> checkstyle does not work on scala code
> see HUDI-4482
> and integration test module
> in GenericRecordFullPayloadGenerator.java
> import com.google.common.annotations.VisibleForTesting;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3545) Make HoodieAvroWriteSupport class configurable

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3545:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Make HoodieAvroWriteSupport class configurable
> --
>
> Key: HUDI-3545
> URL: https://issues.apache.org/jira/browse/HUDI-3545
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Major
> Fix For: 0.13.0
>
>
> Make HoodieAvroWriteSupport class configurable, that way this class can be 
> overridden by custom write support classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4643) MergeInto syntax WHEN MATCHED is optional but must be set

2022-08-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4643:
-
Labels: pull-request-available  (was: )

> MergeInto syntax WHEN MATCHED is optional but must be set
> -
>
> Key: HUDI-4643
> URL: https://issues.apache.org/jira/browse/HUDI-4643
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
>  
> {code:java}
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long,
> | dt string
> |) using hudi
> | location '${tmp.getCanonicalPath.replaceAll("", "/")}/$tableName'
> | tblproperties (
> | primaryKey ='id',
> | preCombineField = 'ts'
> | )
> """.stripMargin)
> // Insert data
> spark.sql(s"insert into $tableName select 1, 'a1', 1, 10, '2022-08-18'")
> spark.sql(
> s"""
> | merge into $tableName as t0
> | using (
> | select 1 as id, 'a1' as name, 11 as price, 110 as ts, '2022-08-19' as dt 
> union all
> | select 2 as id, 'a2' as name, 10 as price, 100 as ts, '2022-08-18' as dt
> | ) as s0
> | on t0.id = s0.id
> | when not matched then insert *
> """.stripMargin
> )
> {code}
>  
> {code:java}
> 11493 [Executor task launch worker for task 65] ERROR 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor  - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.AssertionError: assertion 
> failed: hoodie.payload.update.condition.assignments have not set
>     at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:335)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:246)
>  
> {code}
>  
>  
> if  set hoodie.merge.allow.duplicate.on.inserts = true,The result is one more 
> record than expected:
> [1,a1,1.0,10,2022-08-18], [1,a1,11.0,110,2022-08-19], 
> [2,a2,10.0,100,2022-08-18]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3554) Support Custom Naming of RO and RT tables for Hudi MOR table

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3554:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Support Custom Naming of RO and RT tables for Hudi MOR table
> 
>
> Key: HUDI-3554
> URL: https://issues.apache.org/jira/browse/HUDI-3554
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: Jian Feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, the default Hudi MOR RT AND RO tables are using _ro and _rt 
> suffix. They may conflict with some companies or organizations name 
> convention rules.
> so, I think it is better to provide a way to define custom name



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3555) re-use spark config for parquet timestamp format instead of having our own config

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3555:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> re-use spark config for parquet timestamp format instead of having our own 
> config
> -
>
> Key: HUDI-3555
> URL: https://issues.apache.org/jira/browse/HUDI-3555
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: sivabalan narayanan
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We have two diff configs to set the right timestamp format. 
> "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MICROS",
> and spark config
> --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS 
>  
> We should deprecate our own config and just rely on spark's configs. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] dongkelun opened a new pull request, #6443: [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set

2022-08-18 Thread GitBox


dongkelun opened a new pull request, #6443:
URL: https://github.com/apache/hudi/pull/6443

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4575) Initial Kafka Global Offsets in Hudi Kafka Sink Connector

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4575:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Initial Kafka Global Offsets  in Hudi Kafka Sink Connector
> --
>
> Key: HUDI-4575
> URL: https://issues.apache.org/jira/browse/HUDI-4575
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: kafka-connect
>Reporter: Vishal Agarwal
>Priority: Major
> Fix For: 0.13.0
>
>
> Hi team,
> I am trying to run Hudi Sink Connector with Kafka Connect. When the 
> connectors starts, it starts the Transaction coordinator which initialises 
> the global committed  offsets from the Hudi commit file. When its a first 
> time run, there is no commit file and hence it outputs
> [2022-08-08 19:58:20,529] INFO Hoodie Extra Metadata from latest commit is 
> absent (org.apache.hudi.connect.writers.KafkaConnectTransactionServices:147) 
> But say in first time, the earliest kafka offset is not 0, then the process 
> keeps on running the commit timelines. Ideally, the global offsets, at first 
> run, should be set to the earliest kafka offset.
> As per the current implementation, the participant checks the local offset 
> with coordinator offset and when its a mismatch, it sets to 0. But this 
> breaks, when its a fresh run and the global kafka commited offset is not 0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] KnightChess commented on issue #6410: [SUPPORT] MergeInto syntax WHEN MATCHED is not optional but must be set.

2022-08-18 Thread GitBox


KnightChess commented on issue #6410:
URL: https://github.com/apache/hudi/issues/6410#issuecomment-1220182156

   yeah, look like `WHEN MATCHED` is required in the implementation of hudi code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3531) Review and shade transitive dependencies in hudi bundle jar

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3531:
-
Priority: Critical  (was: Major)

> Review and shade transitive dependencies in hudi bundle jar
> ---
>
> Key: HUDI-3531
> URL: https://issues.apache.org/jira/browse/HUDI-3531
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: vinoyang
>Priority: Critical
> Fix For: 0.13.0
>
>
> Detailed feedback in 
> https://github.com/apache/hudi/issues/4793#issuecomment-1038016578
> Scope
> - review and adjust the bundling and shaded dependencies
> - test and verify functionalities in different environments and downstream 
> integration (e.g. with Kyuubi)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3531) Review and shade transitive dependencies in hudi bundle jar

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3531:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Review and shade transitive dependencies in hudi bundle jar
> ---
>
> Key: HUDI-3531
> URL: https://issues.apache.org/jira/browse/HUDI-3531
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.13.0
>
>
> Detailed feedback in 
> https://github.com/apache/hudi/issues/4793#issuecomment-1038016578
> Scope
> - review and adjust the bundling and shaded dependencies
> - test and verify functionalities in different environments and downstream 
> integration (e.g. with Kyuubi)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3599) Not atomicity commit could cause streaming read loss data

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3599:
-
Priority: Critical  (was: Major)

> Not atomicity commit could cause streaming read loss data
> -
>
> Key: HUDI-3599
> URL: https://issues.apache.org/jira/browse/HUDI-3599
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Xiaoqiao He
>Priority: Critical
> Fix For: 0.12.1
>
>
> The current `commit` implement call hierarchy show as following, and 
> `transitionState` invoke write deltacommit file to complete this commit. But 
> `write file` is not atomicity operation on HDFS for instance. 
> {code:java}
> HoodieActiveTimeline.transitionState(HoodieInstant, HoodieInstant, 
> Option, boolean)  (org.apache.hudi.common.table.timeline)
>  HoodieActiveTimeline.transitionState(HoodieInstant, HoodieInstant, 
> Option)  (org.apache.hudi.common.table.timeline)
>   HoodieActiveTimeline.saveAsComplete(HoodieInstant, Option)  
> (org.apache.hudi.common.table.timeline)
>BaseHoodieWriteClient.commit(HoodieTable, String, String, 
> HoodieCommitMetadata, List)  (org.apache.hudi.client)
> BaseHoodieWriteClient.commitStats(String, List, 
> Option>, String, Map>)  
> (org.apache.hudi.client)
>  HoodieFlinkWriteClient.commit(String, List, 
> Option>, String, Map>)  
> (org.apache.hudi.client)
>  HoodieJavaWriteClient.commit(String, List, 
> Option>, String, Map>)  
> (org.apache.hudi.client)
> {code}
> As the 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline#createImmutableFileInPath
>  said as below, there are three step to complete data write: A. create file, 
> B. write data, C. close file handle. Consider `StreamReadMonitoring` traverse 
> this deltacommit file but content is null between step A and B then it will 
> read nothing at the loop. IMO it could loss some commit data for stream read. 
>  
> {code:java}
>   private void createImmutableFileInPath(Path fullPath, Option 
> content) {
> FSDataOutputStream fsout = null;
> try {
>   fsout = metaClient.getFs().create(fullPath, false);
>   if (content.isPresent()) {
> fsout.write(content.get());
>   }
> } catch (IOException e) {
>   throw new HoodieIOException("Failed to create file " + fullPath, e);
> } finally {
>   try {
> if (null != fsout) {
>   fsout.close();
> }
>   } catch (IOException e) {
> throw new HoodieIOException("Failed to close file " + fullPath, e);
>   }
> }
>   }
> {code}
> In order to avoid this corner case, I think we should dependency on `rename` 
> operation to complete commit rather than create-write-close flow. Please 
> correct me if something I missed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4643) MergeInto syntax WHEN MATCHED is optional but must be set

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4643:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> MergeInto syntax WHEN MATCHED is optional but must be set
> -
>
> Key: HUDI-4643
> URL: https://issues.apache.org/jira/browse/HUDI-4643
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
> Fix For: 0.13.0
>
>
>  
> {code:java}
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long,
> | dt string
> |) using hudi
> | location '${tmp.getCanonicalPath.replaceAll("", "/")}/$tableName'
> | tblproperties (
> | primaryKey ='id',
> | preCombineField = 'ts'
> | )
> """.stripMargin)
> // Insert data
> spark.sql(s"insert into $tableName select 1, 'a1', 1, 10, '2022-08-18'")
> spark.sql(
> s"""
> | merge into $tableName as t0
> | using (
> | select 1 as id, 'a1' as name, 11 as price, 110 as ts, '2022-08-19' as dt 
> union all
> | select 2 as id, 'a2' as name, 10 as price, 100 as ts, '2022-08-18' as dt
> | ) as s0
> | on t0.id = s0.id
> | when not matched then insert *
> """.stripMargin
> )
> {code}
>  
> {code:java}
> 11493 [Executor task launch worker for task 65] ERROR 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor  - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.AssertionError: assertion 
> failed: hoodie.payload.update.condition.assignments have not set
>     at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:335)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:246)
>  
> {code}
>  
>  
> if  set hoodie.merge.allow.duplicate.on.inserts = true,The result is one more 
> record than expected:
> [1,a1,1.0,10,2022-08-18], [1,a1,11.0,110,2022-08-19], 
> [2,a2,10.0,100,2022-08-18]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3314) support merge into with no-pk condition

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3314:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> support merge into with no-pk condition
> ---
>
> Key: HUDI-3314
> URL: https://issues.apache.org/jira/browse/HUDI-3314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-3545) Make HoodieAvroWriteSupport class configurable

2022-08-18 Thread Surya Prasanna Yalla (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surya Prasanna Yalla reassigned HUDI-3545:
--

Assignee: Surya Prasanna Yalla

> Make HoodieAvroWriteSupport class configurable
> --
>
> Key: HUDI-3545
> URL: https://issues.apache.org/jira/browse/HUDI-3545
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Major
> Fix For: 0.12.1
>
>
> Make HoodieAvroWriteSupport class configurable, that way this class can be 
> overridden by custom write support classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6442: [HUDI-4449] Support DataSourceV2 Read for Spark3.2

2022-08-18 Thread GitBox


hudi-bot commented on PR #6442:
URL: https://github.com/apache/hudi/pull/6442#issuecomment-1220179204

   
   ## CI report:
   
   * ecc3e06795ac1f4c6439abb9fa53b666d2f5a8eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10816)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] KnightChess commented on issue #6405: [SUPPORT] Using SparkSQL DDL To Create External Error.

2022-08-18 Thread GitBox


KnightChess commented on issue #6405:
URL: https://github.com/apache/hudi/issues/6405#issuecomment-1220179037

   skip step 1
   from then `Stacktrace`, look like something qe in config, make sure you have 
set config `spark.sql.extensions` which value is 
`org.apache.spark.sql.hudi.HoodieSparkSessionExtension`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3416) Incremental read using v2 datasource

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3416:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Incremental read using v2 datasource
> 
>
> Key: HUDI-3416
> URL: https://issues.apache.org/jira/browse/HUDI-3416
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: leesf
>Assignee: leesf
>Priority: Major
> Fix For: 0.13.0
>
>
> currently, we still use v1 format for incremental read, and need to use v2 
> format as well.
> see comment: https://github.com/apache/hudi/pull/4611#discussion_r795089099



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3527) Move metrics classes to hudi-metrics module

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3527:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Move metrics classes to hudi-metrics module
> ---
>
> Key: HUDI-3527
> URL: https://issues.apache.org/jira/browse/HUDI-3527
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Raymond Xu
>Priority: Major
>  Labels: hudi-on-call
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3742) Enable parquet enableVectorizedReader for spark incremental read to prevent pef regression

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3742:
-
Priority: Critical  (was: Major)

> Enable parquet  enableVectorizedReader for spark incremental read to prevent 
> pef regression
> ---
>
> Key: HUDI-3742
> URL: https://issues.apache.org/jira/browse/HUDI-3742
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Tao Meng
>Assignee: Tao Meng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> now we disable parquet  enableVectorizedReader for mor incremental read,
> and set "spark.sql.parquet.recordLevelFilter.enabled" = "true"  to achieve 
> data filter
> which is slow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3657) Unbound the restriction that clean retain commits must be smaller than archive minimum commits

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3657:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Unbound the restriction that clean retain commits must be smaller than 
> archive minimum commits
> --
>
> Key: HUDI-3657
> URL: https://issues.apache.org/jira/browse/HUDI-3657
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.13.0
>
>
> The end-to-end streaming processing is more and more popular around the Flink 
> users now, and the most typical application scenario for streaming ingestion 
> checkpoint interval is within minutes (1min, 5mins ..). Say user sets up the 
> time-interval as 1 minute, and there are about 60 write commits on the 
> timeline for one hour.
> {t1, t2, t3, t4 ...t60}
> Now let's consider the very popular streaming read scenario, people want to 
> keep the history data for a medium live time(usually 1 day or even 1 week), 
> and let's say user configure the cleaning retain commits number as:
> _1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*
> While considering the current cleaning retain commits restriction:
> _num_retain_commits < min_archive_commits_num_
> We must keep at least 1440 commits on the active timeline, that means we have 
> at least:
> _1440 * 3 = 4320_
>  files on the timeline !!! Which is a pressure to the file IO and the 
> metadata scanning (the metadata client). If we do not configure long enough 
> retain time commits, the writer may remove the old files and the reader 
> encounter {{FileNotFoundException}}.
> So, we may find a way to lift restriction that active timeline commits number 
> must be greater than cleaning retain commits.
> One way i can think of is that we remember the last committed cleaning 
> instant and only check that when cleaning (suitable for the hours cleaning 
> strategy). With num_commits cleaning strategy we may need to scan the archive 
> timeline (or metadata table if it is enabled ?)
> Whatever a solution is eagerly needed now !



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3856) Upgrade maven surefire to M5

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3856:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Upgrade maven surefire to M5
> 
>
> Key: HUDI-3856
> URL: https://issues.apache.org/jira/browse/HUDI-3856
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4026) Add support for spark streaming writes to integ test framework

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4026:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Add support for spark streaming writes to integ test framework
> --
>
> Key: HUDI-4026
> URL: https://issues.apache.org/jira/browse/HUDI-4026
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #6437: [HUDI-4632] Map Flink artifact ids correctly as per scala dependence

2022-08-18 Thread GitBox


danny0405 commented on code in PR #6437:
URL: https://github.com/apache/hudi/pull/6437#discussion_r949747182


##
pom.xml:
##
@@ -1823,7 +1823,9 @@
   flink1.15
   
 ${flink1.15.version}
+flink-runtime
 
flink-table-runtime
+
flink-table-planner_2.12

Review Comment:
   Fix the default flink version line 131 ~ line 133:
   
   ```xml
   ${flink1.14.version}
   hudi-flink1.14.x
   1.14
   ```
   should also work ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4248) Upgrade Apache Avro version for hudi-flink

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4248:
-
Component/s: flink

> Upgrade Apache Avro version for hudi-flink
> --
>
> Key: HUDI-4248
> URL: https://issues.apache.org/jira/browse/HUDI-4248
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies, flink
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>
> [CVE-2021-43045|https://github.com/advisories/GHSA-868x-rg4c-cjqg]
> Recommended upgrade version:1.11.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4317) hudi utilities bundle fails/skips to build w/ azure CI and eventually test fails

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4317.
-
Fix Version/s: 0.12.0
   (was: 0.12.1)
   Resolution: Fixed

> hudi utilities bundle fails/skips to build w/ azure CI and eventually test 
> fails
> 
>
> Key: HUDI-4317
> URL: https://issues.apache.org/jira/browse/HUDI-4317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> {code:java}
> 2022-06-24T08:51:53.7411853Z [INFO] hudi-integ-test-bundle 
> . SKIPPED
> 2022-06-24T08:51:53.7412518Z [INFO] 
> 
> 2022-06-24T08:51:53.7412925Z [INFO] BUILD FAILURE
> 2022-06-24T08:51:53.7413488Z [INFO] 
> 
> 2022-06-24T08:51:53.7413919Z [INFO] Total time:  21:03 min (Wall Clock)
> 2022-06-24T08:51:53.7414450Z [INFO] Finished at: 2022-06-24T08:51:53Z
> 2022-06-24T08:51:53.7415064Z [INFO] 
> 
> 2022-06-24T08:51:53.7477185Z [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project 
> hudi-hadoop-hive-docker: An Ant BuildException has occured: Warning: Could 
> not find file 
> /home/vsts/work/1/s/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  to copy.
> 2022-06-24T08:51:53.7479039Z [ERROR] around Ant part ... file="/home/vsts/work/1/s/docker/hoodie/hadoop/hive_base/../../../../packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar"
>  tofile="target/hoodie-utilities.jar"/>... @ 7:203 in 
> /home/vsts/work/1/s/docker/hoodie/hadoop/hive_base/target/antrun/build-main.xml
> 2022-06-24T08:51:53.7480016Z [ERROR] -> [Help 1]{code}
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9495/logs/26]
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9490/logs/26]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4289) Flink sql write to hudi,set the write operation to upsert,There are duplicate data in the hudi partition table.

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4289:
-
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Flink sql write to hudi,set the write operation to upsert,There are duplicate 
> data in the hudi partition table. 
> 
>
> Key: HUDI-4289
> URL: https://issues.apache.org/jira/browse/HUDI-4289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core, flink, flink-sql
>Reporter: eric
>Priority: Major
> Fix For: 0.13.0
>
>
> [[SUPPORT] -- The UPSERT write operation seems did not WORK ! · Issue #5634 · 
> apache/hudi (github.com)|https://github.com/apache/hudi/issues/5634]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4289) Flink sql write to hudi,set the write operation to upsert,There are duplicate data in the hudi partition table.

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4289:
-
Component/s: flink-sql

> Flink sql write to hudi,set the write operation to upsert,There are duplicate 
> data in the hudi partition table. 
> 
>
> Key: HUDI-4289
> URL: https://issues.apache.org/jira/browse/HUDI-4289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core, flink, flink-sql
>Reporter: eric
>Priority: Major
> Fix For: 0.12.1
>
>
> [[SUPPORT] -- The UPSERT write operation seems did not WORK ! · Issue #5634 · 
> apache/hudi (github.com)|https://github.com/apache/hudi/issues/5634]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4317) hudi utilities bundle fails/skips to build w/ azure CI and eventually test fails

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4317:
-
Fix Version/s: 0.12.1

> hudi utilities bundle fails/skips to build w/ azure CI and eventually test 
> fails
> 
>
> Key: HUDI-4317
> URL: https://issues.apache.org/jira/browse/HUDI-4317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> {code:java}
> 2022-06-24T08:51:53.7411853Z [INFO] hudi-integ-test-bundle 
> . SKIPPED
> 2022-06-24T08:51:53.7412518Z [INFO] 
> 
> 2022-06-24T08:51:53.7412925Z [INFO] BUILD FAILURE
> 2022-06-24T08:51:53.7413488Z [INFO] 
> 
> 2022-06-24T08:51:53.7413919Z [INFO] Total time:  21:03 min (Wall Clock)
> 2022-06-24T08:51:53.7414450Z [INFO] Finished at: 2022-06-24T08:51:53Z
> 2022-06-24T08:51:53.7415064Z [INFO] 
> 
> 2022-06-24T08:51:53.7477185Z [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project 
> hudi-hadoop-hive-docker: An Ant BuildException has occured: Warning: Could 
> not find file 
> /home/vsts/work/1/s/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  to copy.
> 2022-06-24T08:51:53.7479039Z [ERROR] around Ant part ... file="/home/vsts/work/1/s/docker/hoodie/hadoop/hive_base/../../../../packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar"
>  tofile="target/hoodie-utilities.jar"/>... @ 7:203 in 
> /home/vsts/work/1/s/docker/hoodie/hadoop/hive_base/target/antrun/build-main.xml
> 2022-06-24T08:51:53.7480016Z [ERROR] -> [Help 1]{code}
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9495/logs/26]
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9490/logs/26]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4317) hudi utilities bundle fails/skips to build w/ azure CI and eventually test fails

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4317:
-
Priority: Critical  (was: Major)

> hudi utilities bundle fails/skips to build w/ azure CI and eventually test 
> fails
> 
>
> Key: HUDI-4317
> URL: https://issues.apache.org/jira/browse/HUDI-4317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> {code:java}
> 2022-06-24T08:51:53.7411853Z [INFO] hudi-integ-test-bundle 
> . SKIPPED
> 2022-06-24T08:51:53.7412518Z [INFO] 
> 
> 2022-06-24T08:51:53.7412925Z [INFO] BUILD FAILURE
> 2022-06-24T08:51:53.7413488Z [INFO] 
> 
> 2022-06-24T08:51:53.7413919Z [INFO] Total time:  21:03 min (Wall Clock)
> 2022-06-24T08:51:53.7414450Z [INFO] Finished at: 2022-06-24T08:51:53Z
> 2022-06-24T08:51:53.7415064Z [INFO] 
> 
> 2022-06-24T08:51:53.7477185Z [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project 
> hudi-hadoop-hive-docker: An Ant BuildException has occured: Warning: Could 
> not find file 
> /home/vsts/work/1/s/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  to copy.
> 2022-06-24T08:51:53.7479039Z [ERROR] around Ant part ... file="/home/vsts/work/1/s/docker/hoodie/hadoop/hive_base/../../../../packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar"
>  tofile="target/hoodie-utilities.jar"/>... @ 7:203 in 
> /home/vsts/work/1/s/docker/hoodie/hadoop/hive_base/target/antrun/build-main.xml
> 2022-06-24T08:51:53.7480016Z [ERROR] -> [Help 1]{code}
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9495/logs/26]
>  
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9490/logs/26]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4283) The memory cost in HoodieMergeHandle is increasing as the size of base file

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4283:
--
Priority: Critical  (was: Major)

> The memory cost in HoodieMergeHandle is increasing as the size of base file
> ---
>
> Key: HUDI-4283
> URL: https://issues.apache.org/jira/browse/HUDI-4283
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving, compaction, performance
>Reporter: sherhomhuang
>Assignee: sherhomhuang
>Priority: Critical
> Fix For: 0.12.1
>
>
> Suppose a situation, there is a large history data in one dimension table. 
> All files are in base path. There are few changes in the table, but for the 
> large num of history data. It costs a large memory in _*HoodieMergeHandle,*_ 
> because it will cache _*writtenRecordKeys*_ in memory.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4289) Flink sql write to hudi,set the write operation to upsert,There are duplicate data in the hudi partition table.

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4289:
-
Component/s: flink

> Flink sql write to hudi,set the write operation to upsert,There are duplicate 
> data in the hudi partition table. 
> 
>
> Key: HUDI-4289
> URL: https://issues.apache.org/jira/browse/HUDI-4289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core, flink
>Reporter: eric
>Priority: Major
> Fix For: 0.12.1
>
>
> [[SUPPORT] -- The UPSERT write operation seems did not WORK ! · Issue #5634 · 
> apache/hudi (github.com)|https://github.com/apache/hudi/issues/5634]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4283) The memory cost in HoodieMergeHandle is increasing as the size of base file

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4283:
--
Fix Version/s: 0.12.1

> The memory cost in HoodieMergeHandle is increasing as the size of base file
> ---
>
> Key: HUDI-4283
> URL: https://issues.apache.org/jira/browse/HUDI-4283
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving, compaction, performance
>Reporter: sherhomhuang
>Assignee: sherhomhuang
>Priority: Major
> Fix For: 0.12.1
>
>
> Suppose a situation, there is a large history data in one dimension table. 
> All files are in base path. There are few changes in the table, but for the 
> large num of history data. It costs a large memory in _*HoodieMergeHandle,*_ 
> because it will cache _*writtenRecordKeys*_ in memory.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4289) Flink sql write to hudi,set the write operation to upsert,There are duplicate data in the hudi partition table.

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4289:
--
Fix Version/s: 0.12.1

> Flink sql write to hudi,set the write operation to upsert,There are duplicate 
> data in the hudi partition table. 
> 
>
> Key: HUDI-4289
> URL: https://issues.apache.org/jira/browse/HUDI-4289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: eric
>Priority: Major
> Fix For: 0.12.1
>
>
> [[SUPPORT] -- The UPSERT write operation seems did not WORK ! · Issue #5634 · 
> apache/hudi (github.com)|https://github.com/apache/hudi/issues/5634]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4282) Throws IOException in method HoodieLogFileReader.isBlockCorrupted()

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4282:
--
Priority: Critical  (was: Major)

> Throws IOException in method HoodieLogFileReader.isBlockCorrupted()
> ---
>
> Key: HUDI-4282
> URL: https://issues.apache.org/jira/browse/HUDI-4282
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction
>Reporter: sherhomhuang
>Assignee: sherhomhuang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Not all dfs throw {_}*EOFException*{_}, when seek index out of the length of 
> file, such as {*}chdfs{*}. So it is not suitable to use _*EOFException*_ to 
> check whether "{_}*currentPos + blocksize - Long.BYTES*{_}" is out of the 
> file length.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4326) Hudi spark datasource error after migrate from 0.8 to 0.11

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4326:
-
Priority: Blocker  (was: Major)

> Hudi spark datasource error after migrate from 0.8 to 0.11
> --
>
> Key: HUDI-4326
> URL: https://issues.apache.org/jira/browse/HUDI-4326
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Kyle Zhike Chen
>Assignee: Kyle Zhike Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> After updated hudi to 0.11 from 0.8, using {{spark.table(fullTableName)}} to 
> read a hudi table is not working, the table has been sync to hive metastore 
> and spark is connected to the metastore. the error is
> org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
> org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> ...
> Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
> at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
> at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> After changing the table to the spark data source table, the table SerDeInfo 
> is missing. I created a pull request.
>  
> related GH issue:
> https://github.com/apache/hudi/issues/5861



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4281) Using hudi to build a large number of tables in spark on hive causes OOM

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4281:
-

Assignee: sivabalan narayanan

> Using hudi to build a large number of tables in spark on hive causes OOM
> 
>
> Key: HUDI-4281
> URL: https://issues.apache.org/jira/browse/HUDI-4281
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.0
> Environment: spark on hive long running service
> spark version: 3.2.1
> hive version: 2.3.x
> hudi version: hudi-spark3.2-bundle_2.12-0.11.0
>Reporter: zhangrenhua
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> On the long-running server of spark on hive, if a large number of tables are 
> created, OOM will occur because the IsolatedClientLoader object cannot be 
> released.
> The Hive Client should adopt the singleton mode, which can avoid the oom 
> caused by the unreleased IsolatedClientLoader object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4340) DeltaStreamer bootstrap failed when metrics on caused by DateTimeParseException: Text '00000000000001999' could not be parsed

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4340:


Assignee: Teng Huo

> DeltaStreamer bootstrap failed when metrics on caused by 
> DateTimeParseException: Text '01999' could not be parsed
> -
>
> Key: HUDI-4340
> URL: https://issues.apache.org/jira/browse/HUDI-4340
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer, metrics
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: error-deltastreamer.log
>
>
> Found this bug in Hudi integrate test ITTestHoodieDemo.java
> HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS is a invalid value, 
> "01", which can not be parsed by DateTimeFormatter with format 
> SECS_INSTANT_TIMESTAMP_FORMAT = "MMddHHmmss" in method 
> HoodieInstantTimeGenerator.parseDateFromInstantTime
> Error code at 
> org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator.parseDateFromInstantTime(HoodieInstantTimeGenerator.java:96)
> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java#L100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4281) Using hudi to build a large number of tables in spark on hive causes OOM

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4281:
--
Fix Version/s: 0.12.1

> Using hudi to build a large number of tables in spark on hive causes OOM
> 
>
> Key: HUDI-4281
> URL: https://issues.apache.org/jira/browse/HUDI-4281
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.0
> Environment: spark on hive long running service
> spark version: 3.2.1
> hive version: 2.3.x
> hudi version: hudi-spark3.2-bundle_2.12-0.11.0
>Reporter: zhangrenhua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> On the long-running server of spark on hive, if a large number of tables are 
> created, OOM will occur because the IsolatedClientLoader object cannot be 
> released.
> The Hive Client should adopt the singleton mode, which can avoid the oom 
> caused by the unreleased IsolatedClientLoader object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4282) Throws IOException in method HoodieLogFileReader.isBlockCorrupted()

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4282:
--
Fix Version/s: 0.12.1

> Throws IOException in method HoodieLogFileReader.isBlockCorrupted()
> ---
>
> Key: HUDI-4282
> URL: https://issues.apache.org/jira/browse/HUDI-4282
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction
>Reporter: sherhomhuang
>Assignee: sherhomhuang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Not all dfs throw {_}*EOFException*{_}, when seek index out of the length of 
> file, such as {*}chdfs{*}. So it is not suitable to use _*EOFException*_ to 
> check whether "{_}*currentPos + blocksize - Long.BYTES*{_}" is out of the 
> file length.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4326) Hudi spark datasource error after migrate from 0.8 to 0.11

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4326:
-
Fix Version/s: 0.12.1

> Hudi spark datasource error after migrate from 0.8 to 0.11
> --
>
> Key: HUDI-4326
> URL: https://issues.apache.org/jira/browse/HUDI-4326
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Kyle Zhike Chen
>Assignee: Kyle Zhike Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> After updated hudi to 0.11 from 0.8, using {{spark.table(fullTableName)}} to 
> read a hudi table is not working, the table has been sync to hive metastore 
> and spark is connected to the metastore. the error is
> org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
> org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> ...
> Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
> at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
> at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> After changing the table to the spark data source table, the table SerDeInfo 
> is missing. I created a pull request.
>  
> related GH issue:
> https://github.com/apache/hudi/issues/5861



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4340) DeltaStreamer bootstrap failed when metrics on caused by DateTimeParseException: Text '00000000000001999' could not be parsed

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4340:
-
Priority: Blocker  (was: Major)

> DeltaStreamer bootstrap failed when metrics on caused by 
> DateTimeParseException: Text '01999' could not be parsed
> -
>
> Key: HUDI-4340
> URL: https://issues.apache.org/jira/browse/HUDI-4340
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer, metrics
>Reporter: Teng Huo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: error-deltastreamer.log
>
>
> Found this bug in Hudi integrate test ITTestHoodieDemo.java
> HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS is a invalid value, 
> "01", which can not be parsed by DateTimeFormatter with format 
> SECS_INSTANT_TIMESTAMP_FORMAT = "MMddHHmmss" in method 
> HoodieInstantTimeGenerator.parseDateFromInstantTime
> Error code at 
> org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator.parseDateFromInstantTime(HoodieInstantTimeGenerator.java:96)
> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java#L100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4281) Using hudi to build a large number of tables in spark on hive causes OOM

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4281:
--
Priority: Critical  (was: Major)

> Using hudi to build a large number of tables in spark on hive causes OOM
> 
>
> Key: HUDI-4281
> URL: https://issues.apache.org/jira/browse/HUDI-4281
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.0
> Environment: spark on hive long running service
> spark version: 3.2.1
> hive version: 2.3.x
> hudi version: hudi-spark3.2-bundle_2.12-0.11.0
>Reporter: zhangrenhua
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> On the long-running server of spark on hive, if a large number of tables are 
> created, OOM will occur because the IsolatedClientLoader object cannot be 
> released.
> The Hive Client should adopt the singleton mode, which can avoid the oom 
> caused by the unreleased IsolatedClientLoader object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4362) Spark: Support dynamic partition filtering in 3.2

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4362:
-
Fix Version/s: 0.13.0

> Spark: Support dynamic partition filtering in 3.2
> -
>
> Key: HUDI-4362
> URL: https://issues.apache.org/jira/browse/HUDI-4362
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: chenliang
>Assignee: chenliang
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4341) HoodieHFileReader is not compatible with Hadoop 3

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4341:
-
Priority: Blocker  (was: Major)

> HoodieHFileReader is not compatible with Hadoop 3
> -
>
> Key: HUDI-4341
> URL: https://issues.apache.org/jira/browse/HUDI-4341
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: spark
> Fix For: 0.12.1
>
>
> [https://github.com/apache/hudi/issues/5765]
> Spark SQL throws "java.lang.NoSuchMethodError: 
> org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()" after 
> a while.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4342) Improve handling of 5xx in timeline server

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4342:
-
Priority: Critical  (was: Major)

> Improve handling of 5xx in timeline server
> --
>
> Key: HUDI-4342
> URL: https://issues.apache.org/jira/browse/HUDI-4342
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.12.1
>
>
> https://github.com/apache/hudi/issues/5780



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4242) Follow up on getAllPartitionPaths perf enhancement

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4242:
--
Priority: Critical  (was: Major)

> Follow up on getAllPartitionPaths perf enhancement
> --
>
> Key: HUDI-4242
> URL: https://issues.apache.org/jira/browse/HUDI-4242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> GetAllPartitionPaths had some perf degradation from 0.9.0 to 0.10.0 and hence 
> we had reverted the change for now. But the change as such was good. So, we 
> want to follow up to see if we can fix/enhance the new code. Old code does 
> not leverage the spark engine to parallelize across diff folders. So, there 
> could be scope for improvement. but from the perf nos, its not straight 
> forward. So creating a follow up ticket.
>  
> excerpt from the findings. 
> For one of my test tables in S3, with EMR cluster (10k partitions) 
>  # With 0.11.0:
> 147 secs.
>  # With this patch as is (where engine context is not used for 2nd phase)
> 5.7 secs.
>  # Latest master + adding engineContext for 2nd phase:
> 16 secs.
>  # I also tried completely rewriting the dag.
> 12 secs.
>   while (!pathsToList.isEmpty()) \{
> // TODO: Get the parallelism from HoodieWriteConfig
> int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
> pathsToList.size());
> // List all directories in parallel
> List dirToFileListing = 
> engineContext.flatMap(pathsToList, path -> {
>   FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
>   return Arrays.stream(fileSystem.listStatus(path));
> }, listingParallelism);
> pathsToList.clear();
> // if current dictionary contains PartitionMetadata, add it to result
> // if current dictionary does not contain PartitionMetadata, add it 
> to queue
> int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
> dirToFileListing.size());
> List, Option>> result = 
> engineContext.map(dirToFileListing, fileStatus -> \{
>   FileSystem fileSystem = 
> fileStatus.getPath().getFileSystem(hadoopConf.get());
>   if (fileStatus.isDirectory()) {
> if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
> fileStatus.getPath())) {
>   return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
> Path(datasetBasePath), fileStatus.getPath())), Option.empty());
> } else if 
> (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
>  \{
>   return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
> }
>   } else if 
> (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
>  \{
> String partitionName = FSUtils.getRelativePartitionPath(new 
> Path(datasetBasePath), fileStatus.getPath().getParent());
> return Pair.of(Option.of(partitionName), Option.empty());
>   }
>   return Pair.of(Option.empty(), Option.empty());
> }, fileListingParallelism);
> partitionPaths.addAll(result.stream().filter(entry -> 
> entry.getKey().isPresent()).map(entry -> entry.getKey().get())
> .collect(Collectors.toList()));
> pathsToList.addAll(result.stream().filter(entry -> 
> entry.getValue().isPresent()).map(entry -> entry.getValue().get())
> .collect(Collectors.toList()));
> }
> So, based on above findings, I will go w/ what we have in this patch in its 
> current state. Will address Raymond's and Alexey's feedback alone and unblock 
> 0.11.1.
>  
> Ref patch: https://github.com/apache/hudi/pull/5829



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4242) Follow up on getAllPartitionPaths perf enhancement

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4242:
--
Fix Version/s: 0.12.1

> Follow up on getAllPartitionPaths perf enhancement
> --
>
> Key: HUDI-4242
> URL: https://issues.apache.org/jira/browse/HUDI-4242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> GetAllPartitionPaths had some perf degradation from 0.9.0 to 0.10.0 and hence 
> we had reverted the change for now. But the change as such was good. So, we 
> want to follow up to see if we can fix/enhance the new code. Old code does 
> not leverage the spark engine to parallelize across diff folders. So, there 
> could be scope for improvement. but from the perf nos, its not straight 
> forward. So creating a follow up ticket.
>  
> excerpt from the findings. 
> For one of my test tables in S3, with EMR cluster (10k partitions) 
>  # With 0.11.0:
> 147 secs.
>  # With this patch as is (where engine context is not used for 2nd phase)
> 5.7 secs.
>  # Latest master + adding engineContext for 2nd phase:
> 16 secs.
>  # I also tried completely rewriting the dag.
> 12 secs.
>   while (!pathsToList.isEmpty()) \{
> // TODO: Get the parallelism from HoodieWriteConfig
> int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
> pathsToList.size());
> // List all directories in parallel
> List dirToFileListing = 
> engineContext.flatMap(pathsToList, path -> {
>   FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
>   return Arrays.stream(fileSystem.listStatus(path));
> }, listingParallelism);
> pathsToList.clear();
> // if current dictionary contains PartitionMetadata, add it to result
> // if current dictionary does not contain PartitionMetadata, add it 
> to queue
> int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
> dirToFileListing.size());
> List, Option>> result = 
> engineContext.map(dirToFileListing, fileStatus -> \{
>   FileSystem fileSystem = 
> fileStatus.getPath().getFileSystem(hadoopConf.get());
>   if (fileStatus.isDirectory()) {
> if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
> fileStatus.getPath())) {
>   return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
> Path(datasetBasePath), fileStatus.getPath())), Option.empty());
> } else if 
> (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
>  \{
>   return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
> }
>   } else if 
> (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
>  \{
> String partitionName = FSUtils.getRelativePartitionPath(new 
> Path(datasetBasePath), fileStatus.getPath().getParent());
> return Pair.of(Option.of(partitionName), Option.empty());
>   }
>   return Pair.of(Option.empty(), Option.empty());
> }, fileListingParallelism);
> partitionPaths.addAll(result.stream().filter(entry -> 
> entry.getKey().isPresent()).map(entry -> entry.getKey().get())
> .collect(Collectors.toList()));
> pathsToList.addAll(result.stream().filter(entry -> 
> entry.getValue().isPresent()).map(entry -> entry.getValue().get())
> .collect(Collectors.toList()));
> }
> So, based on above findings, I will go w/ what we have in this patch in its 
> current state. Will address Raymond's and Alexey's feedback alone and unblock 
> 0.11.1.
>  
> Ref patch: https://github.com/apache/hudi/pull/5829



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4363) Support Clustering row writer to improve performance

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4363:
-
Priority: Critical  (was: Major)

> Support Clustering row writer to improve performance
> 
>
> Key: HUDI-4363
> URL: https://issues.apache.org/jira/browse/HUDI-4363
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance, writer-core
>Reporter: Hui An
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-07-05 at 17.25.13.png
>
>
> 1. Integrate clustering with datasource read and write api, in this way, 
> - enable clustering use Dataset api
> - Unify the read and write operations together, if read/write logic has 
> improvement, clustering can also benefit, such as vectorized read
> 2. Use {{hoodie.datasource.read.paths}} to pass paths for each 
> clusteringOperation
> 3. Introduce {{HoodieInternalWriteStatusCoordinator}} to persist the 
> {{InternalWriteStatus}} of a clustering action. As we can not get it if using 
> Spark datasource.
> 4. Add new configures to control this behavior.
> h4. Test performance
> A test table has 21 columns, 710716 rows, raw data size 929g(in spark 
> memory), after compressed: 38.3g
> executor memory: 50g, 20 instances, and enable global_sort
> Without clustering as row: 32mins, 12sec
> Using clustering as row: 9mins, 51sec
> We can also see the performance improve from test: 
> {{TestHoodieSparkMergeOnReadTableClustering}} and 
> {{testLayoutOptimizationFunctional}}
>  !Screen Shot 2022-07-05 at 17.25.13.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4340) DeltaStreamer bootstrap failed when metrics on caused by DateTimeParseException: Text '00000000000001999' could not be parsed

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4340:
-
Fix Version/s: 0.12.1

> DeltaStreamer bootstrap failed when metrics on caused by 
> DateTimeParseException: Text '01999' could not be parsed
> -
>
> Key: HUDI-4340
> URL: https://issues.apache.org/jira/browse/HUDI-4340
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer, metrics
>Reporter: Teng Huo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: error-deltastreamer.log
>
>
> Found this bug in Hudi integrate test ITTestHoodieDemo.java
> HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS is a invalid value, 
> "01", which can not be parsed by DateTimeFormatter with format 
> SECS_INSTANT_TIMESTAMP_FORMAT = "MMddHHmmss" in method 
> HoodieInstantTimeGenerator.parseDateFromInstantTime
> Error code at 
> org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator.parseDateFromInstantTime(HoodieInstantTimeGenerator.java:96)
> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java#L100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4256) Bulk insert of a large dataset with S3 fails w/ timeline server based markers

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4256:
--
Fix Version/s: 0.12.1

> Bulk insert of a large dataset with S3 fails w/ timeline server based markers
> -
>
> Key: HUDI-4256
> URL: https://issues.apache.org/jira/browse/HUDI-4256
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.12.1
>
>
> When timeline server based markers are used for very large table with 
> bulk_insert row writer (1TB or more), we are running into null pointer 
> exception. 
>  
> {code:java}
> 2022-06-14 13:27:13,135 WARN hudi.DataSourceUtils: Small Decimal Type found 
> in current schema, auto set the value of 
> hoodie.parquet.writelegacyformat.enabled to true
> 2022-06-14 13:27:13,562 WARN metadata.HoodieBackedTableMetadata: Metadata 
> table was not found at path 
> s3a://datasets-abcde/hudi/web_sales/.hoodie/metadata
> 2022-06-14 13:27:13,972 WARN metadata.HoodieBackedTableMetadata: Metadata 
> table was not found at path 
> s3a://datasets-abcde/hudi/web_sales/.hoodie/metadata
> Exception in thread "pool-37-thread-6" java.lang.NullPointerException200) / 
> 200]
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.flushMarkersToFile(MarkerDirState.java:323)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.processMarkerCreationRequests(MarkerDirState.java:219)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.BatchedMarkerCreationRunnable.run(BatchedMarkerCreationRunnable.java:46)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Exception in thread "pool-37-thread-8" java.lang.NullPointerException
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.flushMarkersToFile(MarkerDirState.java:323)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.processMarkerCreationRequests(MarkerDirState.java:219)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.BatchedMarkerCreationRunnable.run(BatchedMarkerCreationRunnable.java:46)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> 2022-06-14 13:31:30,483 WARN impl.BulkDeleteRetryHandler: Bulk delete 
> operation interrupted: delete: com.amazonaws.SdkClientException: Failed to 
> parse XML document with handler class 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$DeleteObjectsHandler:
>  Failed to parse XML document with handler class 
> com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$DeleteObjectsHandler
> Exception in thread "pool-37-thread-22" java.lang.NullPointerException
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.flushMarkersToFile(MarkerDirState.java:323)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.processMarkerCreationRequests(MarkerDirState.java:219)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.BatchedMarkerCreationRunnable.run(BatchedMarkerCreationRunnable.java:46)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Exception in thread "pool-37-thread-10" java.lang.NullPointerException
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.flushMarkersToFile(MarkerDirState.java:323)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.processMarkerCreationRequests(MarkerDirState.java:219)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.BatchedMarkerCreationRunnable.run(BatchedMarkerCreationRunnable.java:46)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Exception in thread "pool-37-thread-14" java.lang.NullPointerException
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.flushMarkersToFile(MarkerDirState.java:323)
>   at 
> org.apache.hudi.timeline.service.handlers.marker.MarkerDirState.processMarkerCreationRequests(MarkerDirState.java:219)
>   at 
> 

[jira] [Updated] (HUDI-4272) Auto detect partitions without PartitionValueExtractor

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4272:
--
Fix Version/s: 0.12.1

> Auto detect partitions without PartitionValueExtractor
> --
>
> Key: HUDI-4272
> URL: https://issues.apache.org/jira/browse/HUDI-4272
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> Deprecate all PartitionValueExtractor; users do not need to configure it for 
> meta sync.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4363) Support Clustering row writer to improve performance

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4363:


Assignee: Hui An

> Support Clustering row writer to improve performance
> 
>
> Key: HUDI-4363
> URL: https://issues.apache.org/jira/browse/HUDI-4363
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance, writer-core
>Reporter: Hui An
>Assignee: Hui An
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-07-05 at 17.25.13.png
>
>
> 1. Integrate clustering with datasource read and write api, in this way, 
> - enable clustering use Dataset api
> - Unify the read and write operations together, if read/write logic has 
> improvement, clustering can also benefit, such as vectorized read
> 2. Use {{hoodie.datasource.read.paths}} to pass paths for each 
> clusteringOperation
> 3. Introduce {{HoodieInternalWriteStatusCoordinator}} to persist the 
> {{InternalWriteStatus}} of a clustering action. As we can not get it if using 
> Spark datasource.
> 4. Add new configures to control this behavior.
> h4. Test performance
> A test table has 21 columns, 710716 rows, raw data size 929g(in spark 
> memory), after compressed: 38.3g
> executor memory: 50g, 20 instances, and enable global_sort
> Without clustering as row: 32mins, 12sec
> Using clustering as row: 9mins, 51sec
> We can also see the performance improve from test: 
> {{TestHoodieSparkMergeOnReadTableClustering}} and 
> {{testLayoutOptimizationFunctional}}
>  !Screen Shot 2022-07-05 at 17.25.13.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4242) Follow up on getAllPartitionPaths perf enhancement

2022-08-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581601#comment-17581601
 ] 

sivabalan narayanan commented on HUDI-4242:
---

mostly this is taken care alreayd. but need to verify once and close

> Follow up on getAllPartitionPaths perf enhancement
> --
>
> Key: HUDI-4242
> URL: https://issues.apache.org/jira/browse/HUDI-4242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> GetAllPartitionPaths had some perf degradation from 0.9.0 to 0.10.0 and hence 
> we had reverted the change for now. But the change as such was good. So, we 
> want to follow up to see if we can fix/enhance the new code. Old code does 
> not leverage the spark engine to parallelize across diff folders. So, there 
> could be scope for improvement. but from the perf nos, its not straight 
> forward. So creating a follow up ticket.
>  
> excerpt from the findings. 
> For one of my test tables in S3, with EMR cluster (10k partitions) 
>  # With 0.11.0:
> 147 secs.
>  # With this patch as is (where engine context is not used for 2nd phase)
> 5.7 secs.
>  # Latest master + adding engineContext for 2nd phase:
> 16 secs.
>  # I also tried completely rewriting the dag.
> 12 secs.
>   while (!pathsToList.isEmpty()) \{
> // TODO: Get the parallelism from HoodieWriteConfig
> int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
> pathsToList.size());
> // List all directories in parallel
> List dirToFileListing = 
> engineContext.flatMap(pathsToList, path -> {
>   FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
>   return Arrays.stream(fileSystem.listStatus(path));
> }, listingParallelism);
> pathsToList.clear();
> // if current dictionary contains PartitionMetadata, add it to result
> // if current dictionary does not contain PartitionMetadata, add it 
> to queue
> int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
> dirToFileListing.size());
> List, Option>> result = 
> engineContext.map(dirToFileListing, fileStatus -> \{
>   FileSystem fileSystem = 
> fileStatus.getPath().getFileSystem(hadoopConf.get());
>   if (fileStatus.isDirectory()) {
> if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
> fileStatus.getPath())) {
>   return Pair.of(Option.of(FSUtils.getRelativePartitionPath(new 
> Path(datasetBasePath), fileStatus.getPath())), Option.empty());
> } else if 
> (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
>  \{
>   return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
> }
>   } else if 
> (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
>  \{
> String partitionName = FSUtils.getRelativePartitionPath(new 
> Path(datasetBasePath), fileStatus.getPath().getParent());
> return Pair.of(Option.of(partitionName), Option.empty());
>   }
>   return Pair.of(Option.empty(), Option.empty());
> }, fileListingParallelism);
> partitionPaths.addAll(result.stream().filter(entry -> 
> entry.getKey().isPresent()).map(entry -> entry.getKey().get())
> .collect(Collectors.toList()));
> pathsToList.addAll(result.stream().filter(entry -> 
> entry.getValue().isPresent()).map(entry -> entry.getValue().get())
> .collect(Collectors.toList()));
> }
> So, based on above findings, I will go w/ what we have in this patch in its 
> current state. Will address Raymond's and Alexey's feedback alone and unblock 
> 0.11.1.
>  
> Ref patch: https://github.com/apache/hudi/pull/5829



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4226) CLI unable to rollback complete instants

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4226:
--
Priority: Critical  (was: Major)

> CLI unable to rollback complete instants
> 
>
> Key: HUDI-4226
> URL: https://issues.apache.org/jira/browse/HUDI-4226
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli, table-service
>Reporter: Raymond Xu
>Priority: Critical
> Fix For: 0.12.1
>
>
> Need to reproduce: write 3 commits to COW table and execute CLI to rollback 
> last 2 commits: `commit rollback --commit xxx`
> CLI IT is currently disabled for this: 
> org.apache.hudi.cli.integ.ITTestCommitsCommand#testRollbackCommit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4226) CLI unable to rollback complete instants

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4226:
--
Fix Version/s: 0.12.1

> CLI unable to rollback complete instants
> 
>
> Key: HUDI-4226
> URL: https://issues.apache.org/jira/browse/HUDI-4226
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli, table-service
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> Need to reproduce: write 3 commits to COW table and execute CLI to rollback 
> last 2 commits: `commit rollback --commit xxx`
> CLI IT is currently disabled for this: 
> org.apache.hudi.cli.integ.ITTestCommitsCommand#testRollbackCommit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4201) Add tooling to delete empty non-completed instants from timeline

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4201:
--
Priority: Critical  (was: Major)

> Add tooling to delete empty non-completed instants from timeline
> 
>
> Key: HUDI-4201
> URL: https://issues.apache.org/jira/browse/HUDI-4201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> If there are empty instants in timeline, older versions of hudi can run into 
> issues. We have put in a fix [here|https://github.com/apache/hudi/pull/5261] 
> for it. But would like to provider users in older versions w/ some tool to 
> assist deleting such empty instants if incase they are not completed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4363) Support Clustering row writer to improve performance

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4363:
-
Fix Version/s: 0.13.0

> Support Clustering row writer to improve performance
> 
>
> Key: HUDI-4363
> URL: https://issues.apache.org/jira/browse/HUDI-4363
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance, writer-core
>Reporter: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-07-05 at 17.25.13.png
>
>
> 1. Integrate clustering with datasource read and write api, in this way, 
> - enable clustering use Dataset api
> - Unify the read and write operations together, if read/write logic has 
> improvement, clustering can also benefit, such as vectorized read
> 2. Use {{hoodie.datasource.read.paths}} to pass paths for each 
> clusteringOperation
> 3. Introduce {{HoodieInternalWriteStatusCoordinator}} to persist the 
> {{InternalWriteStatus}} of a clustering action. As we can not get it if using 
> Spark datasource.
> 4. Add new configures to control this behavior.
> h4. Test performance
> A test table has 21 columns, 710716 rows, raw data size 929g(in spark 
> memory), after compressed: 38.3g
> executor memory: 50g, 20 instances, and enable global_sort
> Without clustering as row: 32mins, 12sec
> Using clustering as row: 9mins, 51sec
> We can also see the performance improve from test: 
> {{TestHoodieSparkMergeOnReadTableClustering}} and 
> {{testLayoutOptimizationFunctional}}
>  !Screen Shot 2022-07-05 at 17.25.13.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4225) Support multi writer with Java write client

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4225:
--
Fix Version/s: 0.13.0

> Support multi writer with Java write client
> ---
>
> Key: HUDI-4225
> URL: https://issues.apache.org/jira/browse/HUDI-4225
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> We don't have multi-writer support with java client. We need to add support 
> similar to spark client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4369) Hudi Kafka Connect Sink writing to GCS bucket

2022-08-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4369:
-
Fix Version/s: 0.12.1

> Hudi Kafka Connect Sink writing to GCS bucket
> -
>
> Key: HUDI-4369
> URL: https://issues.apache.org/jira/browse/HUDI-4369
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: kafka-connect
>Reporter: Vishal Agarwal
>Priority: Critical
> Fix For: 0.12.1
>
>
> Hi team,
> I am trying to use Hudi sink connector with Kafka Connect to write to GCS 
> bucket. But I am getting error regarding "gs" file scheme. I have added all 
> GCS related properties in core-site.xml and the corresponding gcs-connector 
> jar in the plugin path. But still facing the issue.
> The issue was already reported with S3 as per jira 
> https://issues.apache.org/jira/browse/HUDI-3610. But I am unable to get the 
> resolution.
> Happy to discuss on this !
> Thanks
> *StackTrace-*
> %d [%thread] %-5level %logger - %msg%n 
> org.apache.hudi.exception.HoodieException: Fatal error instantiating Hudi 
> Write Provider 
>  at 
> org.apache.hudi.connect.writers.KafkaConnectWriterProvider.(KafkaConnectWriterProvider.java:103)
>  ~[connectors-uber.jar:?]
>  at 
> org.apache.hudi.connect.transaction.ConnectTransactionParticipant.(ConnectTransactionParticipant.java:65)
>  ~[connectors-uber.jar:?]
>  at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:198) 
> [connectors-uber.jar:?]
>  at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) 
> [connectors-uber.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:587)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:67)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:652)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
>  [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231) 
> [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211) 
> [kafka-clients-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:444)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:317) 
> [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
>  [connect-runtime-2.4.1.jar:?]
>  at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
>  [connect-runtime-2.4.1.jar:?]
>  at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177) 
> [connect-runtime-2.4.1.jar:?]
>  at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227) 
> [connect-runtime-2.4.1.jar:?]
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_331]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_331]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_331]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_331]
>  at java.lang.Thread.run(Thread.java:750) [?:1.8.0_331]
> Caused by: org.apache.hudi.exception.HoodieIOException: Failed to get 
> instance of org.apache.hadoop.fs.FileSystem
>  at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:109) 
> ~[connectors-uber.jar:?]
>  at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:100) 
> ~[connectors-uber.jar:?]
>  at org.apache.hudi.client.BaseHoodieClient.(BaseHoodieClient.java:69) 
> ~[connectors-uber.jar:?]
>  at 
> org.apache.hudi.client.BaseHoodieWriteClient.(BaseHoodieWriteClient.java:175)
>  ~[connectors-uber.jar:?]
>  at 
> 

[jira] [Updated] (HUDI-4185) Evaluate alternatives to using "hoodie.properties" as state store for Metadata Table

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4185:
--
Fix Version/s: 0.12.1
   (was: 0.13.0)

> Evaluate alternatives to using "hoodie.properties" as state store for 
> Metadata Table
> 
>
> Key: HUDI-4185
> URL: https://issues.apache.org/jira/browse/HUDI-4185
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.12.1
>
>
> Currently Metadata Table uses "hoodie.properties" file as a state-store 
> adding properties reflecting the state of the metadata table being indexed.
> This is creating some issues (for ex, HUDI-4138) in respect to the 
> "hoodie.properties" lifecycle as most of the already existing code assumes 
> that the file is (mostly) immutable.
> We should re-evaluate our usage of "hoodie.properties" as a state-store given 
> that it has ripple effects on the existing components.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   >