date:20220816

[hudi] branch master updated: [MINOR] Update DOAP with 0.12.0 Release (#6413)

2022-08-16 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 042241fa2c [MINOR] Update DOAP with 0.12.0 Release (#6413)
042241fa2c is described below

commit 042241fa2ca5c90161dc6e062485eef4e0981962
Author: Sagar Sumit 
AuthorDate: Wed Aug 17 11:26:19 2022 +0530

[MINOR] Update DOAP with 0.12.0 Release (#6413)
---
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 7b784ec549..e153fb3d4c 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -96,6 +96,11 @@
 2022-06-18
 0.11.1
   
+  
+Apache Hudi 0.12.0
+2022-08-16
+0.12.0
+

[GitHub] [hudi] codope merged pull request #6413: [MINOR] Update DOAP with 0.12.0 Release

2022-08-16 Thread GitBox



codope merged PR #6413:
URL: https://github.com/apache/hudi/pull/6413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #6413: [MINOR] Update DOAP with 0.12.0 Release

2022-08-16 Thread GitBox



codope commented on PR #6413:
URL: https://github.com/apache/hudi/pull/6413#issuecomment-1217493647

   CI failure is not caused by this patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope opened a new pull request, #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-16 Thread GitBox



codope opened a new pull request, #6417:
URL: https://github.com/apache/hudi/pull/6417

   ### Change Logs
   
   - Release highlights in `website/releases/release-0.12.0.md`
   - Updated `website/releases/download.md`
   - Updated `docusaurus.config.js`
   
   https://user-images.githubusercontent.com/16440354/185044087-53426eef-3936-4700-9dd1-f67d258f7430.png;>
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] shubham-bungee commented on issue #6389: [SUPPORT] HELP :: Using TWO FIELDS to precombine :: 'hoodie.datasource.write.precombine.field': "column1,column2"

2022-08-16 Thread GitBox



shubham-bungee commented on issue #6389:
URL: https://github.com/apache/hudi/issues/6389#issuecomment-1217486370

   > Unfortunately, there is no out of the box solution to use two fields as 
preCombine for now.
   
   Thanks a lot for reply.
   We are a startup, planning to move to hudi, you might see few more support 
tickets coming your way. 
   Your help would be great in building new architecture. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6413: [MINOR] Update DOAP with 0.12.0 Release

2022-08-16 Thread GitBox



hudi-bot commented on PR #6413:
URL: https://github.com/apache/hudi/pull/6413#issuecomment-1217469882

   
   ## CI report:
   
   * 98d233b95b8653fa681b2c24aa900c7a86adddf3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10785)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3407) Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer scenario

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3407:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer 
> scenario
> 
>
> Key: HUDI-3407
> URL: https://issues.apache.org/jira/browse/HUDI-3407
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently there's no guard-rail that would prevent Restore from running 
> concurrently with Writes in Multi-Writer scenario, which might lead to table 
> getting into inconsistent state.
>  
> One of the approaches could be letting Restore to acquire the Write lock for 
> the whole duration of its operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3585) Docs for (consistent) hashing index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3585:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Docs for (consistent) hashing index
> ---
>
> Key: HUDI-3585
> URL: https://issues.apache.org/jira/browse/HUDI-3585
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: docs
>Reporter: Yuwei Xiao
>Priority: Major
> Fix For: 0.12.1
>
>
> User documents related to (consistent) hashing index, will contain the 
> following content:
> - configs to enable bucket index and tuning parameters
> - use cases and demos
> - limitations and restrictions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-10) Auto tune bulk insert parallelism #555

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-10?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-10:

Fix Version/s: 0.12.1
   (was: 0.12.0)

> Auto tune bulk insert parallelism #555
> --
>
> Key: HUDI-10
> URL: https://issues.apache.org/jira/browse/HUDI-10
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Minor
> Fix For: 0.12.1
>
>
> https://github.com/uber/hudi/issues/555



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-686:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, performance
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparkSQLCLIDriver

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3965:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for 
> SparkSQLCLIDriver
> -
>
> Key: HUDI-3965
> URL: https://issues.apache.org/jira/browse/HUDI-3965
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
> ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
> 0.11.0. 
>  
> {code:java}
> java.lang.ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> // launch command
> ./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
> --packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1053) Make ComplexKeyGenerator also support non partitioned Hudi dataset

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1053:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make ComplexKeyGenerator also support non partitioned Hudi dataset
> --
>
> Key: HUDI-1053
> URL: https://issues.apache.org/jira/browse/HUDI-1053
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync, storage-management, writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently When using ComplexKeyGenerator a `default` partition is assumed. 
> Recently there has been interest in supporting non partitioned Hudi datasets 
> that uses ComplexKeyGenerator. This GitHub issue has context - 
> https://github.com/apache/hudi/issues/1747



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-13) Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive cluster #553

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-13:

Fix Version/s: 0.12.1
   (was: 0.12.0)

> Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive 
> cluster #553
> 
>
> Key: HUDI-13
> URL: https://issues.apache.org/jira/browse/HUDI-13
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, hive, Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.12.1
>
>
> https://github.com/uber/hudi/issues/553



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-736) Simplify ReflectionUtils#getTopLevelClasses

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-736:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Simplify ReflectionUtils#getTopLevelClasses 
> 
>
> Key: HUDI-736
> URL: https://issues.apache.org/jira/browse/HUDI-736
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: new-to-hudi
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1061:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Hudi CLI savepoint command fail because of spark conf loading issue
> ---
>
> Key: HUDI-1061
> URL: https://issues.apache.org/jira/browse/HUDI-1061
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Wenning Ding
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> h3. Reproduce
> open hudi-cli.sh and run these two commands:
> {code:java}
> connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint 
> create --commit 2019115109 
> {code}
> {{}}
> {{}}You would see this error:
> {{}}
> {code:java}
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
>  at org.apache.spark.SparkContext.(SparkContext.scala:523) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) 
> at 
> org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>  at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>  at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) 
> at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at 
> java.lang.Thread.run(Thread.java:748){code}
> {{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir  
>  hdfs:///var/log/spark/apps}}, but here hudi cli still uses 
> {{file:/tmp/spark-events}} as the event log dir, which means sparkcontext 
> won't load the configs from {{spark-defaults.conf}}.
> We should make initJavaSparkConf method be able to read configs from spark 
> config file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1101) Decouple Hive dependencies from hudi-spark

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1101:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Decouple Hive dependencies from hudi-spark
> --
>
> Key: HUDI-1101
> URL: https://issues.apache.org/jira/browse/HUDI-1101
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Yanjia Gary Li
>Priority: Major
> Fix For: 0.12.1
>
>
> We have syncHive tool in both hudi-spark and hudi-utilities modules. This 
> might cause dependency conflict when the user don't use Hive at all. We could 
> move all the hive sync related method to hudi-hive-snyc module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1117:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, meta-sync
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1145) Debug if Insert operation calls upsert in case of small file handling path.

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1145:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Debug if Insert operation calls upsert in case of small file handling path.
> ---
>
> Key: HUDI-1145
> URL: https://issues.apache.org/jira/browse/HUDI-1145
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>
> INSERT operations may be triggering UPSERT internally in the Merging process 
> when dealing with small files. This surfaced out of a SLACK thread. Need to 
> config if this is indeed is happening. If yes, this needs to be fixed such 
> that the MERGE HANDLE should not call upsert and instead let conflicting 
> records into the file if it is an INSERT operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1245) Make debugging Integ tests easier

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1245:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make debugging Integ tests easier
> -
>
> Key: HUDI-1245
> URL: https://issues.apache.org/jira/browse/HUDI-1245
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dev-experience, Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> Debugging integ-tests are harder and not easy for except for folks to 
> investigate. this effort tracks the work for the same.
>  
> Also, publish a guide for debugging such integ-tests.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3961) Encounter NoClassDefFoundError when using Spark 3.1 bundle and utilities slim bundle

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3961:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Encounter NoClassDefFoundError when using Spark 3.1 bundle and utilities slim 
> bundle
> 
>
> Key: HUDI-3961
> URL: https://issues.apache.org/jira/browse/HUDI-3961
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When running deltastreamer with both Spark 3.1 and utilities slim bundle 
> (compiled with Spark 3.2 profile), the following exception is thrown:
> {code:java}
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.1.3-bin-hadoop3.2
> export 
> HUDI_SPARK_BUNDLE_JAR=/Users/ethan/Work/lib/hudi_releases/0.11.0-rc3/hudi-spark3.1-bundle_2.12-0.11.0-rc3.jar
> export 
> HUDI_UTILITIES_SLIM_JAR=/Users/ethan/Work/lib/hudi_releases/0.11.0-rc3/hudi-utilities-slim-bundle_2.12-0.11.0-rc3.jar
> /Users/ethan/Work/lib/spark-3.1.3-bin-hadoop3.2/bin/spark-submit \
>       --master local[4] \
>       --driver-memory 4g --executor-memory 2g --num-executors 4 
> --executor-cores 1 \
>       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>       --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>       --conf spark.sql.catalogImplementation=hive \
>       --conf spark.driver.maxResultSize=1g \
>       --conf spark.speculation=true \
>       --conf spark.speculation.multiplier=1.0 \
>       --conf spark.speculation.quantile=0.5 \
>       --conf spark.ui.port=6680 \
>       --conf spark.eventLog.enabled=true \
>       --conf spark.eventLog.dir=/Users/ethan/Work/data/hudi/spark-logs \
>       --packages org.apache.spark:spark-avro_2.12:3.1.3 \
>       --jars 
> /Users/ethan/Work/repo/hudi-benchmarks/target/hudi-benchmarks-0.1-SNAPSHOT.jar,$HUDI_SPARK_BUNDLE_JAR
>  \
>       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
>       $HUDI_UTILITIES_SLIM_JAR \
>       --props $TEST_ROOT_DIR/ds_mor.properties \
>       --source-class BenchmarkDataSource \
>       --source-ordering-field ts \
>       --target-base-path $TEST_ROOT_DIR/test_table \
>       --target-table test_table \
>       --table-type MERGE_ON_READ \
>       --op UPSERT \
>       --continuous{code}
>  
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieException: 
> java.lang.NoClassDefFoundError: org/apache/avro/AvroMissingFieldException
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:191)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:186)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:549)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NoClassDefFoundError: org/apache/avro/AvroMissingFieldException
>     at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>     at 
> org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:189)
>     ... 15 more
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/avro/AvroMissingFieldException
>     at 
> org.apache.hudi.avro.model.HoodieCleanerPlan.newBuilder(HoodieCleanerPlan.java:246)
>     at 
>

[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-992:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap, meta-sync
>Affects Versions: 0.9.0
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-488) Refactor Source classes in hudi-utilities

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-488:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Refactor Source classes in hudi-utilities 
> --
>
> Key: HUDI-488
> URL: https://issues.apache.org/jira/browse/HUDI-488
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.12.1
>
>
> There are copy-and-paste code in some of the Source classes due to the 
> current class inheritance structure.  Refactoring of this part should make it 
> easier and more efficient to create new sources and format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1036:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
> ---
>
> Key: HUDI-1036
> URL: https://issues.apache.org/jira/browse/HUDI-1036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Opening this Jira based on the GitHub issue reported here - 
> [https://github.com/apache/hudi/issues/1735] when hive.input.format = 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to 
> create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub 
> issue more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1590) Support async clustering w/ test suite job

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1590:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support async clustering w/ test suite job
> --
>
> Key: HUDI-1590
> URL: https://issues.apache.org/jira/browse/HUDI-1590
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.1
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> As of now, we only have inline clustering support w/ hoodie test suite job. 
> we need to add support for async clustering. 
> This might be tricky since the regular writes should not overstep w/ 
> clustering. if not the pipeline will fail. So, data generation has to go hand 
> in hand w/ clustering configs. For eg, if clustering will get triggered every 
> 4 commits, data generation should switch partitions for every 4 batches of 
> input. That way there won't be any overstepping and pipeline can run for as 
> many iterations as needed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3864) Avoid fetching all files for all partitions on the read/query path for flink

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3864:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Avoid fetching all files for all partitions on the read/query path for flink
> 
>
> Key: HUDI-3864
> URL: https://issues.apache.org/jira/browse/HUDI-3864
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>
> Fetching all files across all partitions should be avoided in hot path. 
> especially on the query side. we should only fetch files for interested 
> partitions. 
> I inspected HoodieFileIndex for spark and things looks to be ok. We only load 
> files for the partitions involved in the query. 
>  
> {code:java}
> public BaseHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths, 
> {code}
> Querypaths in above argument contains only the partitions involved in the 
> split. 
> later when we load the files, we load only for the matched partitions. 
> {code:java}
> private Map loadPartitionPathFiles() {
>   // List files in all partition paths
>   List pathToFetch = new ArrayList<>();
>   Map cachedPartitionToFiles = new HashMap<>();
>   // Fetch from the FileStatusCache
>   List partitionPaths = getAllQueryPartitionPaths();
>   partitionPaths.forEach(partitionPath -> {
> Option filesInPartition = 
> fileStatusCache.get(partitionPath.fullPartitionPath(basePath));
> if (filesInPartition.isPresent()) {
>   cachedPartitionToFiles.put(partitionPath, filesInPartition.get());
> } else {
>   pathToFetch.add(partitionPath);
> }
>   });
>   Map fetchedPartitionToFiles;
>   if (pathToFetch.isEmpty()) {
> fetchedPartitionToFiles = Collections.emptyMap();
>   } else {
> Map fullPartitionPathsMapToFetch = 
> pathToFetch.stream()
> .collect(Collectors.toMap(
> partitionPath -> 
> partitionPath.fullPartitionPath(basePath).toString(),
> Function.identity())
> );
> fetchedPartitionToFiles =
> FSUtils.getFilesInPartitions(
> engineContext,
> metadataConfig,
> basePath,
> fullPartitionPathsMapToFetch.keySet().toArray(new String[0]),
> fileSystemStorageConfig.getSpillableDir())
> .entrySet()
> .stream()
> .collect(Collectors.toMap(e -> 
> fullPartitionPathsMapToFetch.get(e.getKey()), e -> e.getValue()));
>   }
>   // Update the fileStatusCache
>   fetchedPartitionToFiles.forEach((partitionPath, filesInPartition) -> {
> fileStatusCache.put(partitionPath.fullPartitionPath(basePath), 
> filesInPartition);
>   });
>   return CollectionUtils.combine(cachedPartitionToFiles, 
> fetchedPartitionToFiles);
> } {code}
>  
> I also inspected flink and may we we are loading all files across all 
> partitions. 
>  
> IncrementalInputSplits 
> [L180|https://github.com/apache/hudi/blob/d16740976e3aa89f2d934b0f1c48208dfe40bc5f/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java#L180]
> fileStatuses = fileIndex.getFilesInPartitions();
>  
> HoodieTableSource 
> [L298|https://github.com/apache/hudi/blob/d16740976e3aa89f2d934b0f1c48208dfe40bc5f/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java#L298]
> FileStatus[] fileStatuses = fileIndex.getFilesInPartitions();
>  
> I do see we pass in required partition paths in both places. But will leave 
> it to flink experts to inspect the code once and close out the ticket if no 
> action required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1885) Support Delete/Update Non-Pk Table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1885:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support Delete/Update Non-Pk Table
> --
>
> Key: HUDI-1885
> URL: https://issues.apache.org/jira/browse/HUDI-1885
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, spark-sql
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.12.1
>
>
> Allow to delete/update a non-pk table.
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi;
> delete from h0 where id = 10;
> update h0 set price = 10 where id = 12;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-747) Implement Rollback like API in HoodieWriteClient which can revert all actions

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-747:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement Rollback like API in HoodieWriteClient which can revert all actions 
> --
>
> Key: HUDI-747
> URL: https://issues.apache.org/jira/browse/HUDI-747
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.1
>
>
> Related to HUDI-716 and PR-1432
> The PR address the specific issue of deleting orphaned inflight/requested 
> clean actions by older versions of Hudi. 
> Currently rollback performs reverting only commit and delta-commit 
> operations. We can introduce a new API which will consistently revert all 
> pending actions clean, compact, commit  and delta-commit.  Currently, we dont 
> rollback clean. Instead we expect future clean operations to first finish up 
> pending clean first. By having this new API (rollbackPendingActions), we can 
> let users consistently revert any actions if they want.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2157) Spark write the bucket index table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2157:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Spark write the bucket index table
> --
>
> Key: HUDI-2157
> URL: https://issues.apache.org/jira/browse/HUDI-2157
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1179) Add Row tests to all key generator test classes

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1179:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add Row tests to all key generator test classes
> ---
>
> Key: HUDI-1179
> URL: https://issues.apache.org/jira/browse/HUDI-1179
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1207) Add kafka implementation of write commit callback to Spark datasources

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1207:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add kafka implementation of write commit callback to Spark datasources
> --
>
> Key: HUDI-1207
> URL: https://issues.apache.org/jira/browse/HUDI-1207
> Project: Apache Hudi
>  Issue Type: Task
>Affects Versions: 0.9.0
>Reporter: wangxianghu#1
>Assignee: Trevorzhang
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2003:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinay
>Priority: Minor
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1645) Add unit test to verify clean and rollback instants are archived correctly

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1645:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add unit test to verify clean and rollback instants are archived correctly
> --
>
> Key: HUDI-1645
> URL: https://issues.apache.org/jira/browse/HUDI-1645
> Project: Apache Hudi
>  Issue Type: Test
>  Components: table-service
>Affects Versions: 0.9.0
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.12.1
>
>
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java
> The tests dont seem to cover clean/rollback instants. Add those instants and 
> make sure those instants are archived correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1748) Read operation will possibility fail on mor table rt view when a write operations is concurrency running

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1748:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Read operation will possibility fail on mor table rt view when a write 
> operations is concurrency running
> 
>
> Key: HUDI-1748
> URL: https://issues.apache.org/jira/browse/HUDI-1748
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: lrz
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, query-eng, 
> user-support-issues
> Fix For: 0.12.1
>
>
> during reading operation, a new base file maybe produced by a writting 
> operation. then the reading will opooibility to get a NPE when getSplit. here 
> is the exception stack:
> !https://wa.vision.huawei.com/vision-file-storage/api/file/download/upload-v2/2021/2/15/qwx352829/7bacca8042104499b0991d50b4bc3f2a/image.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1574:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Minor
> Fix For: 0.12.1
>
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1275) Incremental TImeline Syncing causes compaction to fail with FileNotFound exception

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1275:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Incremental TImeline Syncing causes compaction to fail with FileNotFound 
> exception
> --
>
> Key: HUDI-1275
> URL: https://issues.apache.org/jira/browse/HUDI-1275
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> Context: [https://github.com/apache/hudi/issues/2020]
>  
>  
> {{20/08/25 07:17:13 WARN TaskSetManager: Lost task 3.0 in stage 41.0 (TID 
> 2540, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): 
> org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No 
> such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
> at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
> at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
> at 
> org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
> ... 26 more}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1247) Add jmh based benchmarking to hudi

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1247:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add jmh based benchmarking to hudi
> --
>
> Key: HUDI-1247
> URL: https://issues.apache.org/jira/browse/HUDI-1247
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing, tests-ci
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: performance
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1616) Abstract out one off operations within dag

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1616:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Abstract out one off operations within dag
> --
>
> Key: HUDI-1616
> URL: https://issues.apache.org/jira/browse/HUDI-1616
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Priority: Minor
> Fix For: 0.12.1
>
>
> In existing test suite, we have a config called "execute_itr_count". When 
> this is set to N for a particular node, out of 50 odd iterations, this node 
> will be executed only on Nth iteration only. 
> Use-case: 
> we wish to execute clustering node on 10th iteration. but the entire dag 
> needs to be executed for 25 iterations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1329) Support async compaction in spark DF write()

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1329:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support async compaction in spark DF write()
> 
>
> Key: HUDI-1329
> URL: https://issues.apache.org/jira/browse/HUDI-1329
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction, spark, table-service
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.1
>
>
> spark.write().format("hudi").option(operation, "run_compact") to run 
> compaction
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1380) Async cleaning does not work with Timeline Server

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1380:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Async cleaning does not work with Timeline Server
> -
>
> Key: HUDI-1380
> URL: https://issues.apache.org/jira/browse/HUDI-1380
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, table-service, timeline-server
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1212) GDPR: Support deletions of records on all versions of Hudi dataset

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1212:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> GDPR: Support deletions of records on  all versions of Hudi dataset
> ---
>
> Key: HUDI-1212
> URL: https://issues.apache.org/jira/browse/HUDI-1212
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.1
>
>
> Incremental Pull should also stop returning the record on historical  datset 
> when we delete them from latest snapshot.
>  
> Context from Mailing list email :
>  
> Hello,
> I am Siva's colleague and I am working on the problem below as well.
> I would like to describe what we are trying to achieve with Hudi as well as 
> our current way of working and our GDPR and "Right To Be Forgotten " 
> compliance policies.
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words, when 
> we remove a person's data, it should be throughout the historical data and 
> not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and 
> don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be 
> forgotten and therefore we do not want to delete commit files from the 
> history as some have proposed.
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to be 
> forgotten.  We wanted to manipulate the commit times to rebuild the history.
> We found that we couldn't manipulate the commit times and retain the history.
> - replay the data omitting the data of the persons who have requested to be 
> forgotten, but writing to a date-based partition folder using the 
> "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do not 
> ignore data that is unchanged between 2 commit dates as when using the 
> default commit file system, so we will not save on our storage or speed up 
> our  processing using this technique.
> So basically we would like to find a way to apply a strict RTBF, GDPR, 
> maintain history and time-travel (large history) and save storage space using 
> Hudi.
> Can anyone see a way to achieve this?
> Kind Regards,
> David Rosalia
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2154) Add index key field into HoodieKey

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2154:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add index key field into HoodieKey
> --
>
> Key: HUDI-2154
> URL: https://issues.apache.org/jira/browse/HUDI-2154
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1823) Hive/Presto Integration with ORC

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1823:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Hive/Presto Integration with ORC
> 
>
> Key: HUDI-1823
> URL: https://issues.apache.org/jira/browse/HUDI-1823
> Project: Apache Hudi
>  Issue Type: Task
>  Components: storage-management
>Reporter: Teresa Kang
>Priority: Major
> Fix For: 0.12.1
>
>
> Implement HoodieOrcInputFormat to support ORC with spark/presto query engines.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2612) No need to define primary key for flink insert operation

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2612:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> No need to define primary key for flink insert operation
> 
>
> Key: HUDI-2612
> URL: https://issues.apache.org/jira/browse/HUDI-2612
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>
> There is one exception: the MOR table may still needs the pk to generate 
> {{HoodieKey}} for #preCombine and compaction merge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2156) Cluster the table with bucket index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2156:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Cluster the table with bucket index
> ---
>
> Key: HUDI-2156
> URL: https://issues.apache.org/jira/browse/HUDI-2156
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1556) Add App Id and App name to HoodieDeltaStreamerMetrics

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1556:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add App Id and App name to HoodieDeltaStreamerMetrics
> -
>
> Key: HUDI-1556
> URL: https://issues.apache.org/jira/browse/HUDI-1556
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metrics
>Affects Versions: 0.9.0
>Reporter: wangxianghu#1
>Priority: Major
> Fix For: 0.12.1
>
>
> we need something unique to relate Metric data to spark job



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1271) Add utility scripts to perform Restores

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1271:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add utility scripts to perform Restores
> ---
>
> Key: HUDI-1271
> URL: https://issues.apache.org/jira/browse/HUDI-1271
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cli, Utilities
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.12.1
>
>
> We need to expose commands for performing restores.
> We have similar scripts for cleaner : 
> org.apache.hudi.utilities.HoodieCleaner
> We need to add something similar for restores.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1698) Multiwriting for Flink / Java

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1698:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, writer-core
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1779:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, spark
>Reporter: lrz
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1440) Allow option to override schema when doing spark.write

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1440:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Allow option to override schema when doing spark.write
> --
>
> Key: HUDI-1440
> URL: https://issues.apache.org/jira/browse/HUDI-1440
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Need ability to pass schema and use it to create RDD when creating input 
> batch from data-frame. 
>  
> df.write.format("hudi").option("hudi.avro.schema", "")..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2188) Improve test for the insert_overwrite and insert_overwrite_table in hoodieDeltaStreamer

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2188:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Improve test for the insert_overwrite and insert_overwrite_table in 
> hoodieDeltaStreamer
> ---
>
> Key: HUDI-2188
> URL: https://issues.apache.org/jira/browse/HUDI-2188
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Samrat Deb
>Assignee: Samrat Deb
>Priority: Major
> Fix For: 0.12.1
>
>
> InsertOverwrite overwrites only the partitions matching the incoming records. 
> need to add a test that verifies insert_overwrite does not overwrite 
> mismatched partitions. 
> reference -: https://github.com/apache/hudi/pull/3184/files#r670993094



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2946) Upgrade maven plugin to make Hudi be compatible with higher Java versions

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2946:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Upgrade maven plugin to make Hudi be compatible with higher Java versions
> -
>
> Key: HUDI-2946
> URL: https://issues.apache.org/jira/browse/HUDI-2946
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Wenning Ding
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> I saw several issues while building Hudi w/ Java 11:
>  
> {{[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar (default) on project 
> hudi-common: Execution default of goal 
> org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar failed: An API 
> incompatibility was encountered while executing 
> org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar: 
> java.lang.ExceptionInInitializerError: null[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-shade-plugin:3.1.1:shade (default) on project 
> hudi-hadoop-mr-bundle: Error creating shaded jar: Problem shading JAR 
> /workspace/workspace/rchertar.bigtop.hudi-rpm-mainline-6.x-0.9.0/build/hudi/rpm/BUILD/hudi-0.9.0-amzn-1-SNAPSHOT/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.9.0-amzn-1-SNAPSHOT.jar
>  entry org/apache/hudi/hadoop/bundle/Main.class: 
> java.lang.IllegalArgumentException -> [Help 1]}}
>  
> We need to upgrade maven plugin versions to make it be compatible with Java 
> 11.
> Also upgrade dockerfile-maven-plugin to latest versions to support Java 11 
> [https://github.com/spotify/dockerfile-maven/pull/230]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2657) Make inlining configurable based on diff use-case.

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2657:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make inlining configurable based on diff use-case. 
> ---
>
> Key: HUDI-2657
> URL: https://issues.apache.org/jira/browse/HUDI-2657
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.12.1
>
>
> Make inlining configurable based on diff use-case.
> Files partition, column_stats and bloom might need inlining. but record level 
> index may not need inline reading. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2260) Fix hardcoding of SimpleKeyGen for default KeyGenProp for virtual key configs

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2260:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Fix hardcoding of SimpleKeyGen for default KeyGenProp for virtual key configs
> -
>
> Key: HUDI-2260
> URL: https://issues.apache.org/jira/browse/HUDI-2260
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> Fix hardcoding of SimpleKeyGen for default KeyGenProp for virtual key configs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3000) [UMBRELLA] Consistent Hashing Index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3000:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> [UMBRELLA] Consistent Hashing Index
> ---
>
> Key: HUDI-3000
> URL: https://issues.apache.org/jira/browse/HUDI-3000
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: index
>Reporter: Yuwei Xiao
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2884) Allow loading external configs while querying Hudi tables with Spark

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2884:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Allow loading external configs while querying Hudi tables with Spark
> 
>
> Key: HUDI-2884
> URL: https://issues.apache.org/jira/browse/HUDI-2884
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when doing Hudi queries w/ Spark, it won't load the external 
> configurations. Say if customers enabled metadata listing in their global 
> config file, then this would let them actually query w/o metadata feature 
> enabled. This CR fixes this issue and allows loading global configs during 
> the Hudi reading phase.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2638:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rewrite tests around Hudi index
> ---
>
> Key: HUDI-2638
> URL: https://issues.apache.org/jira/browse/HUDI-2638
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> There are duplicate code between `TestFlinkHoodieBloomIndex` and 
> `TestHoodieBloomIndex`, among other test classes.  We should do one pass to 
> clean the test code once the refactoring is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2928:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Evaluate rebasing Hudi's default compression from Gzip to Zstd
> --
>
> Key: HUDI-2928
> URL: https://issues.apache.org/jira/browse/HUDI-2928
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance, storage-management
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot 
> 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png
>
>
> Currently, having Gzip as a default we prioritize Compression/Storage cost at 
> the expense of
>  * Compute (on the {+}write-path{+}): about *30%* of Compute burned during 
> bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
>  * Compute (on the {+}read-path{+}), as well as queries Latencies: queries 
> scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put 
> is *3-4x* less than Snappy, Zstd, 
> [EX|https://stackoverflow.com/a/56410326/3520840])
> P.S Spark switched its default compression algorithm to Snappy [a while 
> ago|https://github.com/apache/spark/pull/12256].
>  
> *EDIT*
> We should actually evaluate putting in 
> [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>  instead of Snappy. It has compression ratios comparable to Gzip, while 
> bringing in much better performance:
> !image-2021-12-03-13-13-02-892.png!
> [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2762:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive
>Reporter: Rajesh Mahindra
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2932) Debug CI failure for TestHoodieBackedMetadata#testCleaningArchivingAndCompaction

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2932:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Debug CI failure for 
> TestHoodieBackedMetadata#testCleaningArchivingAndCompaction
> 
>
> Key: HUDI-2932
> URL: https://issues.apache.org/jira/browse/HUDI-2932
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Major
> Fix For: 0.12.1
>
>
> TestHoodieBackedMetadata#testCleaningArchivingAndCompaction is flaky in CI. 
> Whereas it is consistently passing locally. Need to debug and the find root 
> cause 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2954) Code cleanup: HFileDataBock - using integer keys is never used

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2954:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Code cleanup: HFileDataBock - using integer keys is never used 
> ---
>
> Key: HUDI-2954
> URL: https://issues.apache.org/jira/browse/HUDI-2954
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Minor
> Fix For: 0.12.1
>
>
>  
> KeyField can never be empty for File. If so, there is really no need for 
> falling back to sequential integer keys in the 
> HFileDataBlock::serializeRecords() code path.
>  
> {noformat}
> // Build the record key
> final Field schemaKeyField = 
> records.get(0).getSchema().getField(this.keyField);
> if (schemaKeyField == null) {
>   // Missing key metadata field. Use an integer sequence key instead.
>   useIntegerKey = true;
>   keySize = (int) Math.ceil(Math.log(records.size())) + 1;
> }
> while (itr.hasNext()) {
>   IndexedRecord record = itr.next();
>   String recordKey;
>   if (useIntegerKey) {
> recordKey = String.format("%" + keySize + "s", key++);
>   } else {
> recordKey = record.get(schemaKeyField.pos()).toString();
>   }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2237) Add virtual key support for ORC file format

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2237:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add virtual key support for ORC file format
> ---
>
> Key: HUDI-2237
> URL: https://issues.apache.org/jira/browse/HUDI-2237
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2754) Performance improvement for IncrementalRelation

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2754:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Performance improvement for IncrementalRelation
> ---
>
> Key: HUDI-2754
> URL: https://issues.apache.org/jira/browse/HUDI-2754
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: incremental-query, performance
>Reporter: Jintao
>Assignee: Jintao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When HoodieIncrSource is used to fetch the update from another Hudi table, 
> the IncrementalRelation will be used to read the data. But it has a 
> performance issue because the column pruning and predicate pushdown don't 
> happen. As the result, Hudi reads too much useless data.
> By enabling the column pruning and predicate pushdown, the data to read is 
> reduced dramatically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2504) Add configuration to make HoodieBootstrap support ignoring file suffix

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2504:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add configuration to make HoodieBootstrap support ignoring file suffix
> --
>
> Key: HUDI-2504
> URL: https://issues.apache.org/jira/browse/HUDI-2504
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: bootstrap
>Reporter: liujinhui
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2991) Add rename partition for spark sql

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2991:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add rename partition for spark sql
> --
>
> Key: HUDI-2991
> URL: https://issues.apache.org/jira/browse/HUDI-2991
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> – RENAME partition
> ||{{partition}}||
> |{{age=10}}|
> |{{age=11}}|
> |{{age=12}}|
>  {{ALTER TABLE default.StudentInfo PARTITION (age='10') RENAME TO PARTITION 
> (age='15');}}
> ||{{partition}}||
> |{{age=11}}|
> |{{age=12}}|
> |{{age=15}}|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2988) Add Event time configuration: latency adjustment

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2988:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add Event time configuration: latency adjustment
> 
>
> Key: HUDI-2988
> URL: https://issues.apache.org/jira/browse/HUDI-2988
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
> Fix For: 0.12.1
>
>
> hoodie.payload.event.time.adjust.seconds
> default to 0
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3049) Use flink table name as default synced hive table name

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3049:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Use flink table name as default synced hive table name
> --
>
> Key: HUDI-3049
> URL: https://issues.apache.org/jira/browse/HUDI-3049
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2860) Make timeline server work with concurrent/async table service

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2860:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make timeline server work with concurrent/async table service
> -
>
> Key: HUDI-2860
> URL: https://issues.apache.org/jira/browse/HUDI-2860
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> Make timeline server work with multiple concurrent writers. 
> As of now, if an executor is lagging wrt timeline server (timeline server 
> refreshes its state for every call if timeline has moved), we throw an 
> exception and executor falls back to secondary which will list the file 
> system. 
>  
> Related ticket: https://issues.apache.org/jira/browse/HUDI-2761
>  
> We want to revisit this code and see how can we make timeline server work 
> with multi-writer scenario. 
>  
> Few points to consider:
> 1. Executors should try to call getLatestBaseFilesOnOrBefore() instead of 
> getLatestBaseFiles(). Not calls has to be fixed. the ones doing conflict 
> resolutions, might have to get the latest snapshot always. 
> 2. Fix async services to use separate write client in deltastreamer flow
> 3. Revist every call from executor and set "REFRESH" param on only when 
> matters.
> 4. Sharing embedded timeline server. 
> 5. Check for any holes. when C100 and C101 concurrently started and C101 
> finishes early, if C100 makes getLatestBaseFileOnOrBefore(), do we return 
> base files from C101? 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2808) Supports deduplication for streaming write

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2808:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Supports deduplication for streaming write
> --
>
> Key: HUDI-2808
> URL: https://issues.apache.org/jira/browse/HUDI-2808
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: WangMinChao
>Assignee: WangMinChao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently, flink changlog stream writes to MOR table, which can be 
> deduplicated during batch reading, but it will not be deduplicated during 
> stream reading. However, many users hope that stream reading can also achieve 
> the upsert capability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2940) Support sync database and table created by Flink catalog to hive

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2940:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support sync database and table created by Flink catalog to hive
> 
>
> Key: HUDI-2940
> URL: https://issues.apache.org/jira/browse/HUDI-2940
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: dalongliu
>Priority: Major
> Fix For: 0.12.1
>
>
> As above title says, we should support sync database and table created by 
> Flink catalog to hive, this will help user for analyze table conveniently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2786) Failed to connect to namenode in Docker Demo on Apple M1 chip

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2786:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Failed to connect to namenode in Docker Demo on Apple M1 chip
> -
>
> Key: HUDI-2786
> URL: https://issues.apache.org/jira/browse/HUDI-2786
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, dev-experience
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.1
>
>
> {code:java}
> > ./setup_demo.sh 
> [+] Running 1/0
>  ⠿ compose  Warning: No resource found to remove                              
>                                                                               
>                                             0.0s
> [+] Running 15/15
>  ⠿ namenode Pulled                                                            
>                                                                               
>                                             1.4s
>  ⠿ kafka Pulled                                                               
>                                                                               
>                                             1.3s
>  ⠿ presto-worker-1 Pulled                                                     
>                                                                               
>                                             1.3s
>  ⠿ historyserver Pulled                                                       
>                                                                               
>                                             1.4s
>  ⠿ adhoc-2 Pulled                                                             
>                                                                               
>                                             1.3s
>  ⠿ adhoc-1 Pulled                                                             
>                                                                               
>                                             1.4s
>  ⠿ graphite Pulled                                                            
>                                                                               
>                                             1.3s
>  ⠿ sparkmaster Pulled                                                         
>                                                                               
>                                             1.3s
>  ⠿ hive-metastore-postgresql Pulled                                           
>                                                                               
>                                             1.3s
>  ⠿ presto-coordinator-1 Pulled                                                
>                                                                               
>                                             1.3s
>  ⠿ spark-worker-1 Pulled                                                      
>                                                                               
>                                             1.4s
>  ⠿ hiveserver Pulled                                                          
>                                                                               
>                                             1.3s
>  ⠿ hivemetastore Pulled                                                       
>                                                                               
>                                             1.4s
>  ⠿ zookeeper Pulled                                                           
>                                                                               
>                                             1.3s
>  ⠿ datanode1 Pulled                                                           
>                                                                               
>                                             1.3s
> [+] Running 16/16
>  ⠿ Network compose_default              Created                               
>                                                                               
>                                             0.0s
>  ⠿ Container hive-metastore-postgresql  Started                               
>                                                                               
>                                             1.1s
>  ⠿ Container kafkabroker                Started                               
>                                                                               
>                                             1.1s
>  ⠿ Container zookeeper                  Started                               
>

[jira] [Updated] (HUDI-3017) Infer FlinkStreamer options like table source

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3017:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Infer FlinkStreamer options like table source
> -
>
> Key: HUDI-3017
> URL: https://issues.apache.org/jira/browse/HUDI-3017
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Assignee: singh.zhang
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3304) support partial update on mor table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3304:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> support partial update on mor table 
> 
>
> Key: HUDI-3304
> URL: https://issues.apache.org/jira/browse/HUDI-3304
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Jian Feng
>Assignee: Jian Feng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: image2022-1-13_0-33-5.png
>
>
> h2. current status
>  * OverwriteNonDefaultsWithLatestAvroPayload implement partial update 
> behavior in combineAndGetUpdateValue method
>  * Spark sql also have a 'Merge into' syntax support partial update by 
> ExpressionPayload,
>  * both OverwriteNonDefaultsWithLatestAvroPayload and ExpressionPayload can 
> not handle partial update in preCombine method, so they can only support 
> partial update with COW table
> h2. solution
> make preCombine function also support partial update(need pass schema as 
> parameter)
> !image2022-1-13_0-33-5.png|width=832,height=516!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2867) Make HoodiePartitionPath optional

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2867:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make HoodiePartitionPath optional
> -
>
> Key: HUDI-2867
> URL: https://issues.apache.org/jira/browse/HUDI-2867
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> we should make partition path optional and support end to end for all 
> operations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3335:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>  Labels: hudi-on-call, user-support-issues
> Fix For: 0.12.1
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
>

[jira] [Updated] (HUDI-3121) Spark datasource with bucket index unit test reuse

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3121:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Spark datasource with bucket index unit test reuse
> --
>
> Key: HUDI-3121
> URL: https://issues.apache.org/jira/browse/HUDI-3121
> Project: Apache Hudi
>  Issue Type: Test
>  Components: index, tests-ci
>Reporter: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>
> let `TestMORDataSourceWithBucket` reuse existing unit test by parameterizing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2544) Use standard builder pattern to refactor ConfigProperty

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2544:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Use standard builder pattern to refactor ConfigProperty
> ---
>
> Key: HUDI-2544
> URL: https://issues.apache.org/jira/browse/HUDI-2544
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, configs
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Minor
> Fix For: 0.12.1
>
>
> I notice that currently define a ConfigProperty object by non-standard 
> builder pattern. Only `defaultValue` and `noDefaultValue` methods are 
> executed in `PropertyBuilder`.
>  
> And when call `withAlternatives`, `sinceVersion`, `deprecatedAfter`, 
> `withInferFunction` methods, will create another ConfigProperty object even 
> that will be collected by jvm later.
>  
> So, is it necessary to minor-refactor this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3381) Rebase `HoodieMergeHandle` to operate on `HoodieRecord`

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3381:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rebase `HoodieMergeHandle` to operate on `HoodieRecord`
> ---
>
> Key: HUDI-3381
> URL: https://issues.apache.org/jira/browse/HUDI-3381
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> From RFC-46:
> `HoodieWriteHandle`s will be  
>    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro 
> conversion)
>    2. Using Combining API engine to merge records (when necessary) 
>    3. Passes `HoodieRecord` as is to `FileWriter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2768) Enable async timeline server by default

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2768:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Enable async timeline server by default
> ---
>
> Key: HUDI-2768
> URL: https://issues.apache.org/jira/browse/HUDI-2768
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server, writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>
> Enable async timeline server by default.
>  
> [https://github.com/apache/hudi/pull/3949]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3309) Integrate quickstart examples into integration tests

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3309:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Integrate quickstart examples into integration tests
> 
>
> Key: HUDI-3309
> URL: https://issues.apache.org/jira/browse/HUDI-3309
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, tests-ci
>Reporter: Raymond Xu
>Priority: Minor
> Fix For: 0.12.1
>
>
> - create integration test suite for quickstart examples
> - make the code examples on website pages generated from the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3351) Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3351:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rebase Record combining semantic into `HoodieRecordCombiningEngine`
> ---
>
> Key: HUDI-3351
> URL: https://issues.apache.org/jira/browse/HUDI-3351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> From RFC-46:
> Extract Record Combining (Merge) API from `HoodieRecordPayload` into a 
> standalone, stateless component – `HoodieRecordCombiningEngine`.
> Such component will be
> 1. Abstracted as stateless object providing API to combine records (according 
> to predefined semantics) for engines (Spark, Flink) of interest
> 2. Plug-in point for user-defined combination semantics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3385) Implement Spark-specific `FileReader`s

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3385:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement Spark-specific `FileReader`s
> --
>
> Key: HUDI-3385
> URL: https://issues.apache.org/jira/browse/HUDI-3385
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> To fully avoid using of any intermediate representation (Avro) we will have 
> to also implement engine-specific `FileReader`s
>  
> Initially, we will focus on Spark with other engines to follow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3265) Implement a custom serializer for the WriteStatus

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3265:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement a custom serializer for the WriteStatus
> -
>
> Key: HUDI-3265
> URL: https://issues.apache.org/jira/browse/HUDI-3265
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: sivabalan narayanan
>Assignee: Gary Li
>Priority: Major
>  Labels: sev:normal
> Fix For: 0.12.1
>
>
> When the structure of WriteStatus changed, and when we restart the Flink job 
> with the new version, the job will fail to recover.
> *To Reproduce*
> Steps to reproduce the behavior:
>  # Start a flink job.
>  # Changed the WriteStatus and restart
>  # The job can't recover.
> We need to implement a custom serializer for the WriteStatus.
>  
> Ref issue: [https://github.com/apache/hudi/issues/4032]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3410) Revisit Record-reading Abstractions

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3410:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Revisit Record-reading Abstractions
> ---
>
> Key: HUDI-3410
> URL: https://issues.apache.org/jira/browse/HUDI-3410
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently, while the logic of combining the all Delta Log files (into a set 
> of delta-records) is commonly unified across all query engines. Actual 
> merging it with the base-files is not. 
> We need to revisit that to and make sure: 
>  * Record merging logic is shared across all Query engines
>  * There's no duplication of merging logic (currently merging log-files and 
> base-files are completely isolated)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3317) Partition specific pointed lookup/reading strategy for metadata table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3317:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Partition specific pointed lookup/reading strategy for metadata table
> -
>
> Key: HUDI-3317
> URL: https://issues.apache.org/jira/browse/HUDI-3317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, writer-core
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.12.1
>
>
> Today inline reading can only be turned on for the entire metadata table. 
> Mean all partitions either have this feature enabled or not. But, for smaller 
> partitions like "files" inline is not preferable as it turns off external 
> spillable map caching of records. Where as for other partitions like 
> bloom_filters, inline reading is preferred. We need Partition specific inline 
> reading strategy for metadata table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3167) Update RFC27 with the design for the new HoodieIndex type based on metadata indices

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3167:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Update RFC27 with the design for the new HoodieIndex type based on metadata 
> indices 
> 
>
> Key: HUDI-3167
> URL: https://issues.apache.org/jira/browse/HUDI-3167
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, metadata
>Reporter: Manoj Govindassamy
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3189) Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3189:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Fallback to full table scan with incremental query when files are cleaned up 
> or achived for MOR table
> -
>
> Key: HUDI-3189
> URL: https://issues.apache.org/jira/browse/HUDI-3189
> Project: Apache Hudi
>  Issue Type: Task
>  Components: incremental-query
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available, sev:high
> Fix For: 0.12.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> [https://github.com/apache/hudi/pull/3946]
> was added to fallback to full table scan with incremental query on 2 
> occasions:
>  # files are cleaned up, but active timeline still returns the commits.
>  # commits are archived. 
>  
> There are two follow ups from the original PR. 
> a. fs.isExists() call should be routed to metadata table. 
> b. Add similar support to MOR table. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3342) MOR Delta Block Rollbacks not applied if Lazy Block reading is disabled

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3342:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> MOR Delta Block Rollbacks not applied if Lazy Block reading is disabled
> ---
>
> Key: HUDI-3342
> URL: https://issues.apache.org/jira/browse/HUDI-3342
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Forward Xu
>Priority: Critical
> Fix For: 0.12.1
>
>
> While working on HUDI-3322, i've spotted following contraption:
> When we are rolling back Delta Commits, we add corresponding 
> {{ROLLBACK_PREVIOUS_BLOCK}} Command Block at the back of the "queue". When we 
> restore, we issue a sequence of Rollbacks, which means that stack if such 
> Rollback Blocks could be of size > 1.
> However, when reading that MOR table if the reader does not specify 
> `readBlocksLazily=true`, we'd be merging Blocks eagerly (when instants 
> increment) therefore essentially rendering such Rollback Blocks useless since 
> they can't "unmerge" previously merged records, resurrecting the data that 
> was supposed to be rolled back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3249) Performance Improvements

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3249:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.12.1, 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3241) Support Analyze table in Spark SQL

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3241:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support Analyze table in Spark SQL
> --
>
> Key: HUDI-3241
> URL: https://issues.apache.org/jira/browse/HUDI-3241
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3067) "Table already exists" error with multiple writers and dynamodb

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3067:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> "Table already exists" error with multiple writers and dynamodb
> ---
>
> Key: HUDI-3067
> URL: https://issues.apache.org/jira/browse/HUDI-3067
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Nikita Sheremet
>Assignee: Wenning Ding
>Priority: Critical
> Fix For: 0.12.1
>
>
> How reproduce:
>  # Set up multiple writing 
> [https://hudi.apache.org/docs/concurrency_control/] for dynamodb (do not 
> forget to set _hoodie.write.lock.dynamodb.region_ and 
> {_}hoodie.write.lock.dynamodb.billing_mode{_}). Do not create anty dynamodb 
> table.
>  # Run multiple writers to the table
> (Tested on aws EMR, so multiple writers is EMR steps)
> Expected result - all steps completed.
> Actual result: some steps failed with exception 
> {code:java}
> Caused by: com.amazonaws.services.dynamodbv2.model.ResourceInUseException: 
> Table already exists: truedata_detections (Service: AmazonDynamoDBv2; Status 
> Code: 400; Error Code: ResourceInUseException; Request ID:; Proxy: null)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1403)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1372)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6214)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6181)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeCreateTable(AmazonDynamoDBClient.java:1160)
>   at 
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.createTable(AmazonDynamoDBClient.java:1124)
>   at 
> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.createLockTableInDynamoDB(DynamoDBBasedLockProvider.java:188)
>   at 
> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:99)
>   at 
> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:77)
>   ... 54 more
> 21/12/19 13:42:06 INFO Yar {code}
> This happens because all steps tried to create table at the same time.
>  
> Suggested solution:
> A catch statment for _Table already exists_ exception should be added into 
> dynamodb table creation code. May be with delay and additional check that 
> table is present.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3259) Code Refactor: Common prep records commit util for Spark and Flink

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3259:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Code Refactor: Common prep records commit util for Spark and Flink
> --
>
> Key: HUDI-3259
> URL: https://issues.apache.org/jira/browse/HUDI-3259
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata, writer-core
>Reporter: Manoj Govindassamy
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3155) java.lang.NoSuchFieldError for logical timestamp types when run hive sync tool

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3155:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> java.lang.NoSuchFieldError for logical timestamp types when run hive sync tool
> --
>
> Key: HUDI-3155
> URL: https://issues.apache.org/jira/browse/HUDI-3155
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, meta-sync
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.1
>
>
> https://github.com/apache/hudi/issues/4176
> Looks like parquet-column is not part of the bundle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3391) presto and hive beeline fails to read MOR table w/ 2 or more array fields

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3391:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> presto and hive beeline fails to read MOR table w/ 2 or more array fields
> -
>
> Key: HUDI-3391
> URL: https://issues.apache.org/jira/browse/HUDI-3391
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.12.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We have an issue reported by user 
> [here|[https://github.com/apache/hudi/issues/2657].] Looks like w/ 0.10.0 or 
> later, spark datasource read works, but hive beeline does not work. Even 
> spark.sql (hive table) querying works as well. 
> Another related ticket: 
> [https://github.com/apache/hudi/issues/3834#issuecomment-997307677]
>  
> Steps that I tried:
> [https://gist.github.com/nsivabalan/fdb8794104181f93b9268380c7f7f079]
> From beeline, you will encounter below exception
> {code:java}
> Failed with exception 
> java.io.IOException:org.apache.hudi.org.apache.avro.SchemaParseException: 
> Can't redefine: array {code}
> All linked ticket states that upgrading parquet to 1.11.0 or greater should 
> work. We need to try it out w/ latest master and go from there. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3214) Optimize auto partition in spark

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3214:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Optimize auto partition in spark
> 
>
> Key: HUDI-3214
> URL: https://issues.apache.org/jira/browse/HUDI-3214
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> recently, if partition's value has the format like 
> "pt1=/pt2=/pt3=" which split by slash, Hudi will partition 
> automatically. The directory of this table will have multi partition 
> structure.
> I think it's unpredictable. So create this umbrella task to optimize auto 
> partition in order to make the behavior more reasonable.
> Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.
> There are a few of sub tasks:
>  * add a flag to control whether enable auto-partition, to make the default 
> behavior reasonable..
>  * achieve a new key generator designed specifically for this scenario.
>  * solve the bug about the different schema when enable 
> *hoodie.file.index.enable* or not in this case.
>  
> Test Codes: 
> {code:java}
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGenerator
> val inserts = convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", 
> "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
> newDf.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> option(RECORDKEY_FIELD_OPT_KEY, "uuid").
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> option(TABLE_NAME, tableName).
> mode(Overwrite).
> save(basePath) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3533) Refactor FileSystemBackedTableMetadata and related classes to support getBloomFilters directly

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3533:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Refactor FileSystemBackedTableMetadata and related classes to support 
> getBloomFilters directly
> --
>
> Key: HUDI-3533
> URL: https://issues.apache.org/jira/browse/HUDI-3533
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: Ethan Guo
>Priority: Minor
> Fix For: 0.12.1
>
>
> The api {{getBloomFilters}} is not supported in FileSystemBackedTableMetadata:
> {code:java}
> @Override
>   public Map, ByteBuffer> getBloomFilters(final 
> List> partitionNameFileNameList)
>   throws HoodieMetadataException {
> throw new HoodieMetadataException("Unsupported operation: 
> getBloomFilters!");
>   }{code}
> It's better to support bloom filters without metadata table from 
> FileSystemBackedTableMetadata as well to unify the logic and reduce the 
> special-cased logic for bloom filters between metadata table vs file system 
> backed metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3314) support merge into with no-pk condition

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3314:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> support merge into with no-pk condition
> ---
>
> Key: HUDI-3314
> URL: https://issues.apache.org/jira/browse/HUDI-3314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3487) The global index is enabled regardless of changlog

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3487:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> The global index is enabled regardless of changlog
> --
>
> Key: HUDI-3487
> URL: https://issues.apache.org/jira/browse/HUDI-3487
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, index
>Reporter: waywtdcc
>Assignee: waywtdcc
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3210) [UMBRELLA] A new Presto connector for Hudi

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3210:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> [UMBRELLA] A new Presto connector for Hudi
> --
>
> Key: HUDI-3210
> URL: https://issues.apache.org/jira/browse/HUDI-3210
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: trino-presto
>Reporter: Todd Gao
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.1, 1.0.0
>
>
> This JIRA tracks all the tasks related to building a new Hudi connector in 
> Presto.
> h4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3354) Rebase `HoodieRealtimeRecordReader` to return `HoodieRecord`

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3354:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rebase `HoodieRealtimeRecordReader` to return `HoodieRecord`
> 
>
> Key: HUDI-3354
> URL: https://issues.apache.org/jira/browse/HUDI-3354
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> From RFC-46:
> `HoodieRealtimeRecordReader`s 
> 1. API will be returning opaque `HoodieRecord` instead of raw Avro payload



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3717) Avoid double-listing w/in BaseHoodieTableFileIndex

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3717:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Avoid double-listing w/in BaseHoodieTableFileIndex
> --
>
> Key: HUDI-3717
> URL: https://issues.apache.org/jira/browse/HUDI-3717
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2022-03-25 at 7.05.09 PM.png, Screen Shot 
> 2022-03-25 at 7.05.43 PM.png, Screen Shot 2022-03-25 at 7.14.20 PM.png
>
>
> Currently in `BaseHoodieTableFileIndex::loadPartitionPathFiles` essentially 
> does file-listing twice: 
>  * Once when `getAllQueryPartitionPaths` is invoked
>  * Second time when `getFilesInPartitions` is invoked
>  
> While this will not result in double-listing of the files on FS (b/c of 
> `FIleStatusCache`, if any), this leads however to MT being queried twice: 
> !Screen Shot 2022-03-25 at 7.14.20 PM.png!
>  
> !Screen Shot 2022-03-25 at 7.05.09 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 >

1 - 100 of 497 matches

Mail list logo