[jira] [Updated] (HUDI-4248) Upgrade Apache Avro version for hudi-flink

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4248:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Upgrade Apache Avro version for hudi-flink
> --
>
> Key: HUDI-4248
> URL: https://issues.apache.org/jira/browse/HUDI-4248
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies, flink
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>
> [CVE-2021-43045|https://github.com/advisories/GHSA-868x-rg4c-cjqg]
> Recommended upgrade version:1.11.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3519) Make sure every public Hudi Client Method invokes necessary prologue

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3519:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure every public Hudi Client Method invokes necessary prologue
> 
>
> Key: HUDI-3519
> URL: https://issues.apache.org/jira/browse/HUDI-3519
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.14.0
>
>
> Right now, only a handful of operations actually invoke the "prologue" method 
> doing, for ex
>  # Checks around whether the table needs to be upgraded
>  # Bootstraps MDT (if necessary)
> As well as some other minor book-keeping stuff. As part of 
> [https://github.com/apache/hudi/pull/4739,] i had to address that and 
> introduced universal method `initTable` that serves as such prologue.
> However, while i've injected it into most major public methods of the Hudi 
> Client's Base class, we need to carefully and holistically review all 
> remaining exposed *public* methods and make sure that all _public-facing_ 
> operations (insert, upsert, commit, delete, rollback, clean, etc) are 
> invoking prologue properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4245) Support nested fields in Column Stats Index

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4245:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support nested fields in Column Stats Index
> ---
>
> Key: HUDI-4245
> URL: https://issues.apache.org/jira/browse/HUDI-4245
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently only root-level fields are supported in the Column Stats Index, 
> while there's no reason for us not to be able to support nested fields given 
> that columnar file formats store nested fields as _nested columns,_ ie as 
> columns with a name of the field and corresponding struct it attributes to. 
>  
> For example following schema: 
> {code:java}
> c1: StringType
> c2: StructType(Seq(StructField("foo", StringType))){code}
> Would be stored in Parquet as "c1: string", "c2.foo: string", entailing that 
> Parquet actually already collects statistics for all the nested fields and we 
> just need to make sure we're propagating them into Column Stats Index
>  
> Original GH issue:
> [https://github.com/apache/hudi/issues/5804#issuecomment-1152983029]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3674) Remove unnecessary HBase-related dependencies from bundles if there is any

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3674:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Remove unnecessary HBase-related dependencies from bundles if there is any
> --
>
> Key: HUDI-3674
> URL: https://issues.apache.org/jira/browse/HUDI-3674
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> [https://github.com/apache/hudi/pull/5004/files] A follow-up of HUDI-1180. 
> vinothchandar 6 days ago Member
> is the absolute minimal set of artifacts needed
>  
>  alexeykudinkin 6 days ago Contributor
> Need not to take as part of this PR, but i actually want to suggest one step 
> further:
> Since we're mostly reliant on HFile and the classes it's dependent on, can we 
> try to filter out packages that won't break it?
> My hunch is that we can greatly reduce 16Mb overhead number by just cleaning 
> up all the stuff that is bolted onto HBase.
> 
> 1
>  
>  codope 4 days ago Member
> That's a good idea. In fact, i've tried out but it's a very manual 
> time-consuming process to verify. I gave up after a few failures. And keep 
> future upgrades in mind. But, i would be very happy to reduce the bundle size 
> in any way we can and we should take another stab at this idea in future.
>  
>  yihua 4 days ago Author Member
> Yeah, that's good to have. The problem as @codope pointed out is that such a 
> process is time-consuming. For now, what I can say is that the newly added 
> artifacts are necessary, since I started with the old pom, incrementally 
> added new artifacts as I saw NoClassDef exception until every test can pass.
> One thing we may try later is to add and trim hudi-hbase-shaded by excluding 
> transitives and only depend on hudi-hbase-shaded here.
>  
>  alexeykudinkin 3 days ago Contributor
> Yeah, it's tedious manual process for sure, but i think we can do it pretty 
> fast: we just look at the packages imported by HFile, then look at files that 
> are imported by HFile, and so on. Then after that we can run the tests if we 
> collected it properly or not.
> The hypothesis is that this set should be reasonably bounded (why wouldn't 
> it?) so this iteration should be pretty fast.
> Can you please create a task and link it here to follow-up?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2767) Enable timeline server based marker type as default

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2767:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Enable timeline server based marker type as default
> ---
>
> Key: HUDI-2767
> URL: https://issues.apache.org/jira/browse/HUDI-2767
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Enable timeline server based marker type as default



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3879) Suppress exceptions that are not fatal in HoodieMetadataTableValidator

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3879:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Suppress exceptions that are not fatal in HoodieMetadataTableValidator
> --
>
> Key: HUDI-3879
> URL: https://issues.apache.org/jira/browse/HUDI-3879
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Yue Zhang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> If there is no partition available yet, only print a warning message and 
> continue, without printing the exception.
> {code:java}
> org.apache.hudi.exception.HoodieException: Unable to do hoodie metadata table 
> validation in 
> file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_c_multi_writer/c5_mor_09mt_011mt/test_table
>     at 
> org.apache.hudi.utilities.HoodieMetadataTableValidator.run(HoodieMetadataTableValidator.java:364)
>     at 
> org.apache.hudi.utilities.HoodieMetadataTableValidator.main(HoodieMetadataTableValidator.java:345)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.IllegalArgumentException: Positive number of partitions 
> required
>     at 
> org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:118)
>     at 
> org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:96)
>     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>     at scala.Option.getOrElse(Option.scala:189)
>     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>     at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>     at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362)
>     at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361)
>     at 
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
>     at 
> org.apache.hudi.data.HoodieJavaRDD.collectAsList(HoodieJavaRDD.java:157)
>     at 
> org.apache.hudi.utilities.HoodieMetadataTableValidator.doMetadataTableValidation(HoodieMetadataTableValidator.java:451)
>     at 
> org.apache.hudi.utilities.HoodieMetadataTableValidator.doHoodieMetadataTableValidationOnce(HoodieMetadataTableValidator.java:375)
>     at 
> org.apache.hudi.utilities.HoodieMetadataTableValidator.run(HoodieMetadataTableValidator.java:361)
>     ... 13 more  {code}
> Suppress the TableNotFound exception if Metadata table is not available to 
> read for now:
> {code:java}
> 22/04/11 17:05:57 WARN HoodieMetadataTableValidator: Metadata table is not 
> available to ready for now, 
> org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in 
> path 
> file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_c_multi_writer/c5_mor_09mt_011mt/test_table/.hoodie/metadata/.hoodie
>     at 
> org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:57)
>     at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:125)
>     at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:79)
>     at 
> 

[jira] [Updated] (HUDI-3115) Kafka Connect should not be packaged as a bundle

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3115:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Kafka Connect should not be packaged as a bundle
> 
>
> Key: HUDI-3115
> URL: https://issues.apache.org/jira/browse/HUDI-3115
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies, kafka-connect
>Reporter: cdmikechen
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> Currently, Kafka Connect is packaged based on bundles, but in fact, most 
> Kafka Connect projects do not package all dependencies into one jar.
> I hoped that the packaging method by maven can be adjusted so that it can be 
> easily synchronized to the confluent hub in the future



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3531) Review and shade transitive dependencies in hudi bundle jar

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3531:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Review and shade transitive dependencies in hudi bundle jar
> ---
>
> Key: HUDI-3531
> URL: https://issues.apache.org/jira/browse/HUDI-3531
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: vinoyang
>Priority: Critical
> Fix For: 0.14.0
>
>
> Detailed feedback in 
> https://github.com/apache/hudi/issues/4793#issuecomment-1038016578
> Scope
> - review and adjust the bundling and shaded dependencies
> - test and verify functionalities in different environments and downstream 
> integration (e.g. with Kyuubi)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3321) HFileWriter, HFileReader and HFileDataBlock should avoid hardcoded key field name

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3321:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> HFileWriter, HFileReader and HFileDataBlock should avoid hardcoded key field 
> name
> -
>
> Key: HUDI-3321
> URL: https://issues.apache.org/jira/browse/HUDI-3321
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> Today HFileReader has the hardcoded key field name for the schema. This key 
> field is used by HFileWriter, HFileDataBlock for HFile key dedupliacation 
> feature. When users want to use the HFile format, this hard coded key field 
> name can be preventing the usecase. We need a way to pass in the 
> writer/reader/query configs to the HFile storage layer so as to use the right 
> key field name. 
>  
> Related: HUDI-2763



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3317) Partition specific pointed lookup/reading strategy for metadata table

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-3317:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Partition specific pointed lookup/reading strategy for metadata table
> -
>
> Key: HUDI-3317
> URL: https://issues.apache.org/jira/browse/HUDI-3317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, writer-core
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> Today inline reading can only be turned on for the entire metadata table. 
> Mean all partitions either have this feature enabled or not. But, for smaller 
> partitions like "files" inline is not preferable as it turns off external 
> spillable map caching of records. Where as for other partitions like 
> bloom_filters, inline reading is preferred. We need Partition specific inline 
> reading strategy for metadata table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2737) Use earliest instant by default for compaction and clustering job

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2737:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Use earliest instant by default for compaction and clustering job
> -
>
> Key: HUDI-2737
> URL: https://issues.apache.org/jira/browse/HUDI-2737
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, HoodieCompactor (compaction) and HoodieClusteringJob (clustering) 
> require a command-line argument of the instant time for async executions.  To 
> improve the usability of these jobs, by default the jobs can search for the 
> earliest instant of the corresponding action for execution, to save one step 
> of searching the instant time from the user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2736) Redundant metadata table initialization by the metadata writer

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2736:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Redundant metadata table initialization by the metadata writer
> --
>
> Key: HUDI-2736
> URL: https://issues.apache.org/jira/browse/HUDI-2736
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
>  
> HoodieBackedTableMetadataWriter has redundant table initialization in the 
> following code paths.
>  # Constructor => initialize() => bootstrap  => initTableMetadata()
>  # Constructor => initTableMetadata()
>  # Flink client => preCommit => initTableMetadata()
> Apart from other refreshing of timeline happens in the init table metadata. 
> Before removing the redundant call, need to verify if they have been added 
> for the getting the latest timeline. 
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2458) Relax compaction in metadata being fenced based on inflight requests in data table

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2458:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Relax compaction in metadata being fenced based on inflight requests in data 
> table
> --
>
> Key: HUDI-2458
> URL: https://issues.apache.org/jira/browse/HUDI-2458
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data 
> table.
> Compaction in metadata is triggered only if there are no inflight requests in 
> data table. This might cause liveness problem since for very large 
> deployments, we could either have compaction or clustering always in 
> progress. So, we should try to see how we can relax this constraint.
>  
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away 
> with this. 
> As of now, we have 3 inter linked nuances.
>  - Compaction in metadata may not kick in, if there are any inflight 
> operations in data table. 
>  - Rollback when being applied to metadata table has a dependency on last 
> compaction instant in metadata table. We might even throw exception if 
> instant being rolledback is < latest metadata compaction instant time. 
>  - Archival in data table is fenced by latest compaction in metadata table. 
>  
> So, just incase data timeline has any dangling inflght operation (lets say 
> someone tried clustering, and killed midway and did not ever attempt again), 
> metadata compaction will never kick in at all for good. I need to check what 
> does archival do for such inflight operations in data table though when it 
> tries to archive near by commits. 
>  
> So, with spurious deletes support which we added recently, all these can be 
> much simplified. 
> Whenever we want to apply a rollback commit, we don't need to take different 
> actions based on whether the commit being rolled back is already committed to 
> metadata table or not. Just go ahead and apply the rollback. Merging of 
> metadata payload records will take care of this. If the commit was already 
> synced, final merged payload may not have spurious deletes. If the commit 
> being rolledback was never committed to metadata, final merged payload may 
> have some spurious deletes which we can ignore. 
> With this, compaction in metadata does not need to have any dependency on 
> inflight operations in data table. 
> And we can loosen up the dependency of archival in data table on metadata 
> table compaction as well. 
> So, in summary, all the 3 dependencies quoted above will be moot if we go 
> with this approach. Archival in data table does not have any dependency on 
> metadata table compaction. Rollback when being applied to metadata table does 
> not care about last metadata table compaction. Compaction in metadata table 
> can proceed even if there are inflight operations in data table. 
>  
> Especially our logic to apply rollback metadata to metadata table will become 
> a lot simpler and is easy to reason about. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2388) Add test nodes for Spark SQL in integration test suite

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-2388:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Add test nodes for Spark SQL in integration test suite
> --
>
> Key: HUDI-2388
> URL: https://issues.apache.org/jira/browse/HUDI-2388
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing, tests-ci
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1101) Decouple Hive dependencies from hudi-spark

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-1101:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Decouple Hive dependencies from hudi-spark
> --
>
> Key: HUDI-1101
> URL: https://issues.apache.org/jira/browse/HUDI-1101
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Yanjia Gary Li
>Priority: Critical
> Fix For: 0.14.0
>
>
> We have syncHive tool in both hudi-spark and hudi-utilities modules. This 
> might cause dependency conflict when the user don't use Hive at all. We could 
> move all the hive sync related method to hudi-hive-snyc module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6127) Flink Hudi Support Commit on empty batch

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6127:

Issue Type: Improvement  (was: New Feature)

> Flink Hudi Support Commit on empty batch 
> -
>
> Key: HUDI-6127
> URL: https://issues.apache.org/jira/browse/HUDI-6127
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> In flink multi-writer scenario, if one of the writers has no data, it will 
> keep its inflight instant in the timeline. But incremental clean and archive 
> will be blocked by the oldest inflight commit in the timeline.
> So to make clean and archive can move forward  we should upport starting a 
> new instant even when there is no data in the last batch. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5517) HoodieTimeline support filter instants by state transition time

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5517:

Fix Version/s: (was: 0.13.1)

> HoodieTimeline support filter instants by state transition time
> ---
>
> Key: HUDI-5517
> URL: https://issues.apache.org/jira/browse/HUDI-5517
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: core, incremental-query
>Reporter: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Hudi timeline can actually miss some instants if we incremental pulling from 
> upstream hudi table, which is written by several writers.
> For example, say we have 2 writers writing data to the hudi table, and the 
> last success incremental pulling end timestamp is 001
> w1 is writing 002, w2 is writing 003, if w2 is finished earlier than the w1, 
> then the incremental pulling end timestamp will be updated to 003, and 
> actually w1's commit: 002 will be skipped since it's instant time is earlier 
> than the w2's.
> We actually needs to use commit end time(state transition time) to filter the 
> commits if using incremental pulling. As w2's state transition time is 
> earlier than the w1's, so w1's data won't be filtered.
> This relates to the HUDI-1623 but not adding end time to the end of each 
> commit, instead use `FileStatus.getModificationTime` to represent the end 
> time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6091) Add Java 11 and 17 to bundle validation image

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6091:

Issue Type: Improvement  (was: New Feature)

> Add Java 11 and 17 to bundle validation image
> -
>
> Key: HUDI-6091
> URL: https://issues.apache.org/jira/browse/HUDI-6091
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5941) Support savepoint CALL procedure with table base path

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5941:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support savepoint CALL procedure with table base path
> -
>
> Key: HUDI-5941
> URL: https://issues.apache.org/jira/browse/HUDI-5941
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Related GH issue: https://github.com/apache/hudi/issues/7589



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6176) Fix flaky test testArchivalWithMultiWriters

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6176:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix flaky test testArchivalWithMultiWriters
> ---
>
> Key: HUDI-6176
> URL: https://issues.apache.org/jira/browse/HUDI-6176
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> {code:java}
> [ERROR] testArchivalWithMultiWriters{boolean}[1]  Time elapsed: 68.25 s  <<< 
> ERROR!
> 2023-04-30T03:49:36.0590893Z java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieCommitException: Failed to archive commits
> 2023-04-30T03:49:36.0591382Z     at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2023-04-30T03:49:36.0591833Z     at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2023-04-30T03:49:36.0592347Z     at 
> org.apache.hudi.io.TestHoodieTimelineArchiver.testArchivalWithMultiWriters(TestHoodieTimelineArchiver.java:683)
> 2023-04-30T03:49:36.0592661Z     at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2023-04-30T03:49:36.0592916Z     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2023-04-30T03:49:36.0593587Z     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2023-04-30T03:49:36.0593883Z     at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2023-04-30T03:49:36.0594151Z     at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2023-04-30T03:49:36.0594478Z     at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2023-04-30T03:49:36.0594853Z     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2023-04-30T03:49:36.0595215Z     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2023-04-30T03:49:36.0595564Z     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
> 2023-04-30T03:49:36.0595934Z     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
> 2023-04-30T03:49:36.0596316Z     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2023-04-30T03:49:36.0596990Z     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2023-04-30T03:49:36.0597381Z     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2023-04-30T03:49:36.0597760Z     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2023-04-30T03:49:36.0598140Z     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2023-04-30T03:49:36.0598520Z     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2023-04-30T03:49:36.0598866Z     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2023-04-30T03:49:36.0599178Z     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
> 2023-04-30T03:49:36.0599556Z     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
> 2023-04-30T03:49:36.0599941Z     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> 2023-04-30T03:49:36.0600300Z     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
> 2023-04-30T03:49:36.0600678Z     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
> 2023-04-30T03:49:36.0601041Z     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
> 2023-04-30T03:49:36.0601398Z     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
> 2023-04-30T03:49:36.0601770Z     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> 2023-04-30T03:49:36.0602145Z     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
> 2023-04-30T03:49:36.0602470Z 

[jira] [Updated] (HUDI-6138) HoodieAvroRecord - Fix Option get for empty values

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6138:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> HoodieAvroRecord - Fix Option get for empty values  
> 
>
> Key: HUDI-6138
> URL: https://issues.apache.org/jira/browse/HUDI-6138
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Details at - [https://github.com/apache/hudi/issues/8278]
> Check the option if empty before calling get. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6061) NPE with nullable MapType and new hudi merger

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6061:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> NPE with nullable MapType and new hudi merger
> -
>
> Key: HUDI-6061
> URL: https://issues.apache.org/jira/browse/HUDI-6061
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: nicolas paris
>Priority: Major
> Fix For: 0.14.0
>
>
> In 0.13.0, when dealing with null map values during an upsert with the new 
> hudi merger api, then null pointer raises. AFAIK, it happens when both 
> MapTypes are containing null in different maner.
>  
> See [issue]([https://github.com/apache/hudi/issues/8431)] for details
> See [PR]([https://github.com/apache/hudi/pull/8432)] for details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5904) support more than one update actions in merge into table

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5904:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> support more than one update actions in merge into table
> 
>
> Key: HUDI-5904
> URL: https://issues.apache.org/jira/browse/HUDI-5904
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.1, 0.12.2, 0.13.0
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> spark.sql(
> s"""
> |merge into hudi_cow_pt_tbl as target
> |using (
> | select id, name, data, country, ts from inc_table
> |) source
> |on source.id = target.id
> |when matched and source.data > target.data then
> |update set target.data = source.data, target.ts = source.ts
> |when matched and source.data = 6 then
> |update set target.data = source.data, target.ts = source.ts
> |when not matched then
> |insert *
> |""".stripMargin)
>  
> when we execute sql above,would forbidden.But most business need more than 
> once update action in actual 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5890) Fix build failure of asf-site branch

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5890:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix build failure of asf-site branch
> 
>
> Key: HUDI-5890
> URL: https://issues.apache.org/jira/browse/HUDI-5890
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> [https://github.com/apache/hudi/pull/8116/checks]
>  
> {code:java}
> Please check the pages of your site in the list below, and make sure you 
> don't reference any path that does not exist.
> Note: it's possible to ignore broken links with the 'onBrokenLinks' 
> Docusaurus configuration, and let the build pass.Exhaustive list of all 
> broken links found:- On source page path = /docs/querying_data:
>    -> linking to /docs/next/query_engine_setup#redshift-spectrum (resolved 
> as: /docs/next/query_engine_setup)
>    -> linking to /docs/next/query_engine_setup#doris (resolved as: 
> /docs/next/query_engine_setup)
>    -> linking to /docs/next/query_engine_setup#starrocks (resolved as: 
> /docs/next/query_engine_setup)    at reportMessage 
> (/home/runner/work/hudi/hudi/website/node_modules/@docusaurus/utils/lib/index.js:306:19)
>     at handleBrokenLinks 
> (/home/runner/work/hudi/hudi/website/node_modules/@docusaurus/core/lib/server/brokenLinks.js:138:35)
>     at async buildLocale 
> (/home/runner/work/hudi/hudi/website/node_modules/@docusaurus/core/lib/commands/build.js:155:5)
>     at async tryToBuildLocale 
> (/home/runner/work/hudi/hudi/website/node_modules/@docusaurus/core/lib/commands/build.js:33:20)
>     at async mapAsyncSequencial 
> (/home/runner/work/hudi/hudi/website/node_modules/@docusaurus/utils/lib/index.js:262:24)
>     at async build 
> (/home/runner/work/hudi/hudi/website/node_modules/@docusaurus/core/lib/commands/build.js:68:25)
> Error: Process completed with exit code 1.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6011) Hudi CLI show archived commits is broken for replace commit

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6011:

Fix Version/s: (was: 0.13.1)

> Hudi CLI show archived commits is broken for replace commit
> ---
>
> Key: HUDI-6011
> URL: https://issues.apache.org/jira/browse/HUDI-6011
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> show archived commits does not handle replacecommit and resulted in parsing 
> empty row for display, which leads to unsuccessful run and throwing exception 
> to users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5914) Fix for RowData class cast exception

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5914:

Fix Version/s: (was: 0.13.1)

> Fix for RowData class cast exception
> 
>
> Key: HUDI-5914
> URL: https://issues.apache.org/jira/browse/HUDI-5914
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5968) Global index update partition for MOR creating duplicates

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5968:

Fix Version/s: (was: 0.13.1)

> Global index update partition for MOR creating duplicates
> -
>
> Key: HUDI-5968
> URL: https://issues.apache.org/jira/browse/HUDI-5968
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6025) Incremental read with MOR doesn't give correct results

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-6025:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Incremental read with MOR doesn't give correct results
> --
>
> Key: HUDI-6025
> URL: https://issues.apache.org/jira/browse/HUDI-6025
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Aditya Goenka
>Assignee: Lokesh Jain
>Priority: Critical
>  Labels: data-consistency, user-support-issues
> Fix For: 0.14.0
>
>
> See - [https://github.com/apache/hudi/issues/8222]
>  
> MOR table is giving incorrect results for incremental queries. But for COW 
> tables we are getting it correctly. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5824) COMBINE_BEFORE_UPSERT=false option does not work for upsert

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5824:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> COMBINE_BEFORE_UPSERT=false option does not work for upsert
> ---
>
> Key: HUDI-5824
> URL: https://issues.apache.org/jira/browse/HUDI-5824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.1, 0.12.2, 0.13.0
>Reporter: kazdy
>Assignee: kazdy
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
> shouldCombine does not take into account the situation where the write 
> operation is UPSERT but COMBINE_BEFORE_UPSERT is false.
> Currently, Hudi always combines records on UPSERT, and option 
> COMBINE_BEFORE_UPSERT is not honored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5867) Use commons.io v2.7+ for hbase-server

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5867:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Use commons.io v2.7+ for hbase-server
> -
>
> Key: HUDI-5867
> URL: https://issues.apache.org/jira/browse/HUDI-5867
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.12.2, 0.13.0
>Reporter: Xingcan Cui
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.12.3, 0.14.0
>
>
> The {{hbase-server:2.4.9}} lib we use relies on {{commons.io}} v2.7+, but the 
> actual resolved version is 2.4. It causes the following exception.
> {code:java}
> java.lang.NoSuchMethodError: 'void 
> org.apache.hudi.org.apache.commons.io.IOUtils.closeQuietly(java.io.Closeable, 
> java.util.function.Consumer)'
> at 
> org.apache.hudi.org.apache.hadoop.hbase.io.hfile.HFileInfo.initTrailerAndContext(HFileInfo.java:352)
> at 
> org.apache.hudi.org.apache.hadoop.hbase.io.hfile.HFileInfo.(HFileInfo.java:124)
> at 
> org.apache.hudi.io.storage.HoodieHFileUtils.createHFileReader(HoodieHFileUtils.java:84)
> at 
> org.apache.hudi.io.storage.HoodieHFileReader.(HoodieHFileReader.java:104)
> at 
> org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.deserializeRecords(HoodieHFileDataBlock.java:168)
> at 
> org.apache.hudi.common.table.log.block.HoodieDataBlock.readRecordsFromBlockPayload(HoodieDataBlock.java:189)
> at 
> org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecordIterator(HoodieDataBlock.java:147)
> at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.getRecordsIterator(AbstractHoodieLogRecordReader.java:492)
> at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:379)
> at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:467)
> at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:343)
> at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:192)
> at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:109)
> at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:102)
> at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:63)
> at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:51)
> at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader$Builder.build(HoodieMetadataMergedLogRecordReader.java:230)
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:515)
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.openReaders(HoodieBackedTableMetadata.java:428)
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getOrCreateReaders(HoodieBackedTableMetadata.java:413)
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:219)
> at java.base/java.util.HashMap.forEach(Unknown Source)
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:217)
> at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:148)
> at 
> org.apache.hudi.metadata.BaseTableMetadata.fetchAllFilesInPartition(BaseTableMetadata.java:323)
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:141)
> at 
> org.apache.hudi.metadata.HoodieMetadataFileSystemView.listPartition(HoodieMetadataFileSystemView.java:65)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:307)
> at 
> java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(Unknown 
> Source)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:298)
> at 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView.getLatestMergedFileSlicesBeforeOrOn(AbstractTableFileSystemView.java:704)
> at 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:103)
> at 
> org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestMergedFileSlicesBeforeOrOn(PriorityBasedFileSystemView.java:188)
> at 
> 

[jira] [Updated] (HUDI-5864) Update release notes regarding the HoodieMetadataFileSystemView regression

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5864:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Update release notes regarding the HoodieMetadataFileSystemView regression
> --
>
> Key: HUDI-5864
> URL: https://issues.apache.org/jira/browse/HUDI-5864
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Relevant bug and fix: HUDI-5863



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5866) Fix unnecessary log messages during bulk insert in Spark

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5866:

Fix Version/s: (was: 0.13.1)

> Fix unnecessary log messages during bulk insert in Spark
> 
>
> Key: HUDI-5866
> URL: https://issues.apache.org/jira/browse/HUDI-5866
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Akira Ajisaka
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> HUDI-5544 fixed excessive log message issue in Flink, but it's not fixed in 
> Spark. We need to make a similar fix in hudi-spark-client  
> https://github.com/apache/hudi/blob/47356a57930687c1bdfa66d1a62421d8a5fc0b29/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BulkInsertDataInternalWriterHelper.java#L147



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] codope commented on a diff in pull request #8775: [HUDI-5584] Metasync update props when changed

2023-05-22 Thread via GitHub


codope commented on code in PR #8775:
URL: https://github.com/apache/hudi/pull/8775#discussion_r1201549502


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieMetaSyncOperations.java:
##
@@ -186,16 +188,20 @@ default void updateLastCommitTimeSynced(String tableName) 
{
 
   /**
* Update the table properties in metastore.
+   *
+   * @return true if properties updated.
*/
-  default void updateTableProperties(String tableName, Map 
tableProperties) {
-
+  default boolean updateTableProperties(String tableName, Map 
tableProperties) {
+return false;
   }
 
   /**
* Update the SerDe properties in metastore.
+   *
+   * @return true if properties updated.
*/
-  default void updateSerdeProperties(String tableName, Map 
serdeProperties) {
-
+  default boolean updateSerdeProperties(String tableName, Map 
serdeProperties, boolean useRealtimeFormat) {

Review Comment:
   Got it. Can you please create a JIRA to track? We can work on it sometime 
later after 0.14 critical items.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5760) Make sure DeleteBlock doesn't use Kryo for serialization to disk

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5760:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure DeleteBlock doesn't use Kryo for serialization to disk
> 
>
> Key: HUDI-5760
> URL: https://issues.apache.org/jira/browse/HUDI-5760
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 1.0.0
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> The problem is that serialization of the `HoodieDeleteBlock` is generated 
> dynamically by Kryo that could change whenever any class comprising it 
> changes.
> We've been bitten by this already twice:
> HUDI-5758
> HUDI-4959
>  
> Instead, anything that is persisted on disk have to be serialized using 
> hard-coded methods (same way HoodieDataBlock are serailized)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5807) HoodieSparkParquetReader is not appending partition-path values

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5807:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> HoodieSparkParquetReader is not appending partition-path values
> ---
>
> Key: HUDI-5807
> URL: https://issues.apache.org/jira/browse/HUDI-5807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Current implementation of HoodieSparkParquetReader isn't supporting the case 
> when "hoodie.datasource.write.drop.partition.columns" is set to true.
> In that case partition-path values are expected to be parsed from 
> partition-path and be injected w/in the File Reader (this is behavior of 
> Spark's own readers)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5759) Hudi do not support add column on mor table with log

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5759:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Hudi do not support add column on mor table with log
> 
>
> Key: HUDI-5759
> URL: https://issues.apache.org/jira/browse/HUDI-5759
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Qijun Fu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> We test the following sqls in the latest master branch 
> ```sql
> create table h0 (
>   id int,
>   name string,
>   price double,
>   ts long
> ) using hudi
>  options (
>   primaryKey ='id',
>   type = 'mor',
>   preCombineField = 'ts'
>  )
>  partitioned by(ts)
>  location '/tmp/h0';
> insert into h0 select 1, 'a1', 10, 1000;
> update h0 set price = 20 where id = 1;
> alter table h0 add column new_col1 int;
> update h0 set price = 22 where id = 1;
> select * from h0;
> ```
> And found that we can't read the table after add column and update. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5769) Partitions created by Async indexer could be deleted by regular writers

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5769:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Partitions created by Async indexer could be deleted by regular writers
> ---
>
> Key: HUDI-5769
> URL: https://issues.apache.org/jira/browse/HUDI-5769
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> In regular writer we have a flow, where we detect if some MDT partition is 
> not enabled, but the partition is found in storage and as part of table 
> config's fully built out partitions, hudi deletes the metadata partition with 
> the intent that user wishes to disable it. 
> But this does not sit well w/ async indexer. 
>  
> process1 -> Deltastreamer runs continuously. 
> no metadata configs set. 
> which means, default value for metadata enable = true and hence "files" 
> partition will be instantiated inline on first commit. 
> no value set for col stats enable. So, no action will be taken. 
>  
> process2: user starts HoodieIndexer for col stats partition. 
> Once indexer completes, tableConfig will add "col stats" as part of fully 
> built out metadata partition. 
>  
> While in process1, when deltastreamer goes to next write, it will detect that 
> col stats wasn't enabled (default value as per code), but tableConfig shows 
> that col stats is fully built out, and hence decides to delete the col stats 
> partition and updates the tableConfig. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5737) Fix Deletes issued without any prior commits

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5737:

Fix Version/s: (was: 0.13.1)

> Fix Deletes issued without any prior commits
> 
>
> Key: HUDI-5737
> URL: https://issues.apache.org/jira/browse/HUDI-5737
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5733) TestHoodieDeltaStreamer.testHoodieIndexer failure

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5733:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> TestHoodieDeltaStreamer.testHoodieIndexer failure
> -
>
> Key: HUDI-5733
> URL: https://issues.apache.org/jira/browse/HUDI-5733
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer, index, tests-ci
>Reporter: Jonathan Vexler
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.14.0
>
>
> Sometimes it fails because in the metadata table a rollback occurs and rolls 
> back a commit but the deltastreamer tries to change the instance from 
> requested to inflight. This fails because the requested file has been removed 
> from the timeline
>  
> Here is an example of a failing [test stack 
> trace|https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15021=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0=746585d8-b50a-55c3-26c5-517d93af9934=30526]
> {code:java}
> Caused by: java.lang.IllegalArgumentException
>   at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
>   at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:633)
>   at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionRequestedToInflight(HoodieActiveTimeline.java:698)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.saveWorkloadProfileMetadataToInflight(BaseCommitActionExecutor.java:147)
>   at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:172)
>   at 
> org.apache.hudi.table.action.deltacommit.SparkUpsertPreppedDeltaCommitActionExecutor.execute(SparkUpsertPreppedDeltaCommitActionExecutor.java:44)
>   at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsertPrepped(HoodieSparkMergeOnReadTable.java:111)
>   at 
> org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsertPrepped(HoodieSparkMergeOnReadTable.java:80)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.upsertPreppedRecords(SparkRDDWriteClient.java:154)
>   at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:172)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:823)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:890)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$writeTableMetadata$1(BaseHoodieWriteClient.java:355)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:355)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:282)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:233)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:102)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:61)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:199)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:713)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:395)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$1(HoodieDeltaStreamer.java:716)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5731) Fix com.google.common classes still being relocated in Hudi Spark bundle

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5731:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix com.google.common classes still being relocated in Hudi Spark bundle
> 
>
> Key: HUDI-5731
> URL: https://issues.apache.org/jira/browse/HUDI-5731
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: dzcxzl
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> As originally reported in:
> [https://github.com/apache/hudi/pull/6240#issuecomment-1420149952]
>  
> The issue have been that after removal of Guava we still kept following 
> relocations configs in MR/Spark bundles:
> {code:java}
> 
>   com.google.common.
>   org.apache.hudi.com.google.common.
>  {code}
> Which in turn meant that all references from any class referencing Guava 
> would be shaded, even though Hudi isn't packaging Guava anymore. This might 
> result in following exception:
> {code:java}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/google/common/base/Preconditions
>   at 
> org.apache.curator.ensemble.fixed.FixedEnsembleProvider.(FixedEnsembleProvider.java:39)
>   at 
> org.apache.curator.framework.CuratorFrameworkFactory$Builder.connectString(CuratorFrameworkFactory.java:193)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperClientProvider$.buildZookeeperClient(ZookeeperClientProvider.scala:62)
>   at 
> org.apache.kyuubi.ha.client.zookeeper.ZookeeperDiscoveryClient.(ZookeeperDiscoveryClient.scala:65)
>   ... 45 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan commented on a diff in pull request #8775: [HUDI-5584] Metasync update props when changed

2023-05-22 Thread via GitHub


xushiyan commented on code in PR #8775:
URL: https://github.com/apache/hudi/pull/8775#discussion_r1201547833


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieMetaSyncOperations.java:
##
@@ -186,16 +188,20 @@ default void updateLastCommitTimeSynced(String tableName) 
{
 
   /**
* Update the table properties in metastore.
+   *
+   * @return true if properties updated.
*/
-  default void updateTableProperties(String tableName, Map 
tableProperties) {
-
+  default boolean updateTableProperties(String tableName, Map 
tableProperties) {
+return false;
   }
 
   /**
* Update the SerDe properties in metastore.
+   *
+   * @return true if properties updated.
*/
-  default void updateSerdeProperties(String tableName, Map 
serdeProperties) {
-
+  default boolean updateSerdeProperties(String tableName, Map 
serdeProperties, boolean useRealtimeFormat) {

Review Comment:
   yea all metasync api should be standardized with a return flag or a pojo 
containing sync result. it's just not that critical to change right now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5670) Server-based markers creation times out

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5670:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Server-based markers creation times out
> ---
>
> Key: HUDI-5670
> URL: https://issues.apache.org/jira/browse/HUDI-5670
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Running writing benchmarks w/ SparkRecordMerger enabled hitting this 
> SocketTimeoutException trying to create markers:
> {code:java}
> ception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
>         at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
>         at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1535)
>         at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1445)
>         at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1509)
>         at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1332)
>         at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F10%2F29/69adadb4-d7ae-4b30-8af1-92ffa38be7df-0_1362-352-97811_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:84)
>                                                                               
>                                                                               
> [3324/4704]
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 22 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> partition=2020%2F11%2F29/8e3045e1-6de0-492e-bc34-85e2b8502767-0_1207-352-97656_20230201020238055.parquet.marker.CREATE
> Read timed out
>         at 
> org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:73)
>         at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
>         ... 24 more
> Caused by: 

[GitHub] [hudi] bvaradar commented on a diff in pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-22 Thread via GitHub


bvaradar commented on code in PR #8303:
URL: https://github.com/apache/hudi/pull/8303#discussion_r1201546844


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapRelation.scala:
##
@@ -188,11 +188,23 @@ case class HoodieBootstrapRelation(override val 
sqlContext: SQLContext,
 
   override def updatePrunedDataSchema(prunedSchema: StructType): 
HoodieBootstrapRelation =
 this.copy(prunedDataSchema = Some(prunedSchema))
+
+  def toHadoopFsRelation: HadoopFsRelation = {
+  HadoopFsRelation(
+location = fileIndex,
+partitionSchema = fileIndex.partitionSchema,
+dataSchema = fileIndex.dataSchema,
+bucketSpec = None,
+fileFormat = fileFormat,
+optParams)(sparkSession)
+  }
 }
 
 
 object HoodieBootstrapRelation {
 
+  val USE_FAST_BOOTSTRAP_READ = 
"hoodie.bootstrap.relation.use.fast.bootstrap.read"

Review Comment:
   @jonvex : Can we just use one config hoodie.bootstrap.data.queries.only and 
get away with hoodie.bootstrap.relation.use.fast.bootstrap.read ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5711) NPE occurs when enabling metadata on table which does'nt has metadata previously

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5711:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> NPE occurs when enabling metadata on table which does'nt has metadata 
> previously
> 
>
> Key: HUDI-5711
> URL: https://issues.apache.org/jira/browse/HUDI-5711
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, metadata
>Reporter: Danny Chen
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.14.0
>
>
> https://github.com/apache/hudi/issues/7824#issuecomment-1420170722



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan commented on a diff in pull request #8775: [HUDI-5584] Metasync update props when changed

2023-05-22 Thread via GitHub


xushiyan commented on code in PR #8775:
URL: https://github.com/apache/hudi/pull/8775#discussion_r1201544493


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java:
##
@@ -280,83 +282,87 @@ protected void syncHoodieTable(String tableName, boolean 
useRealtimeInputFormat,
   partitionsChanged = syncPartitions(tableName, writtenPartitionsSince, 
droppedPartitions);
 }
 
-boolean meetSyncConditions = schemaChanged || partitionsChanged;
+boolean meetSyncConditions = schemaChanged || propertiesChanged || 
partitionsChanged;
 if (!config.getBoolean(META_SYNC_CONDITIONAL_SYNC) || meetSyncConditions) {
   syncClient.updateLastCommitTimeSynced(tableName);
 }
 LOG.info("Sync complete for " + tableName);
   }
 
-  /**
-   * Get the latest schema from the last commit and check if its in sync with 
the hive table schema. If not, evolves the
-   * table schema.
-   *
-   * @param tableExists does table exist
-   * @param schema  extracted schema
-   */
-  private boolean syncSchema(String tableName, boolean tableExists, boolean 
useRealTimeInputFormat,
-  boolean readAsOptimized, MessageType schema) {
-// Append spark table properties & serde properties
+  private Map getTableProperties(MessageType schema) {
 Map tableProperties = 
ConfigUtils.toMap(config.getString(HIVE_TABLE_PROPERTIES));
-Map serdeProperties = 
ConfigUtils.toMap(config.getString(HIVE_TABLE_SERDE_PROPERTIES));
 if (config.getBoolean(HIVE_SYNC_AS_DATA_SOURCE_TABLE)) {
   Map sparkTableProperties = 
SparkDataSourceTableUtils.getSparkTableProperties(config.getSplitStrings(META_SYNC_PARTITION_FIELDS),
   config.getStringOrDefault(META_SYNC_SPARK_VERSION), 
config.getIntOrDefault(HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD), schema);
-  Map sparkSerdeProperties = 
SparkDataSourceTableUtils.getSparkSerdeProperties(readAsOptimized, 
config.getString(META_SYNC_BASE_PATH));
   tableProperties.putAll(sparkTableProperties);
+}
+return tableProperties;
+  }
+
+  private Map getSerdeProperties(boolean readAsOptimized) {
+Map serdeProperties = 
ConfigUtils.toMap(config.getString(HIVE_TABLE_SERDE_PROPERTIES));
+if (config.getBoolean(HIVE_SYNC_AS_DATA_SOURCE_TABLE)) {
+  Map sparkSerdeProperties = 
SparkDataSourceTableUtils.getSparkSerdeProperties(readAsOptimized, 
config.getString(META_SYNC_BASE_PATH));
   serdeProperties.putAll(sparkSerdeProperties);
 }
-boolean schemaChanged = false;
-// Check and sync schema
-if (!tableExists) {
-  LOG.info("Hive table {} is not found. Creating it with schema {}.", 
tableName, schema);
-  HoodieFileFormat baseFileFormat = 
HoodieFileFormat.valueOf(config.getStringOrDefault(META_SYNC_BASE_FILE_FORMAT).toUpperCase());
-  String inputFormatClassName = 
HoodieInputFormatUtils.getInputFormatClassName(baseFileFormat, 
useRealTimeInputFormat);
-
-  if (baseFileFormat.equals(HoodieFileFormat.PARQUET) && 
config.getBooleanOrDefault(HIVE_USE_PRE_APACHE_INPUT_FORMAT)) {
-// Parquet input format had an InputFormat class visible under the old 
naming scheme.
-inputFormatClassName = useRealTimeInputFormat
-? 
com.uber.hoodie.hadoop.realtime.HoodieRealtimeInputFormat.class.getName()
-: com.uber.hoodie.hadoop.HoodieInputFormat.class.getName();
-  }
+return serdeProperties;
+  }
 
-  String outputFormatClassName = 
HoodieInputFormatUtils.getOutputFormatClassName(baseFileFormat);
-  String serDeFormatClassName = 
HoodieInputFormatUtils.getSerDeClassName(baseFileFormat);
+  private void syncFirstTime(String tableName, boolean useRealTimeInputFormat, 
boolean readAsOptimized, MessageType schema) {
+LOG.info("Sync table {} for the first time.", tableName);
+HoodieFileFormat baseFileFormat = 
HoodieFileFormat.valueOf(config.getStringOrDefault(META_SYNC_BASE_FILE_FORMAT).toUpperCase());
+String inputFormatClassName = getInputFormatClassName(baseFileFormat, 
useRealTimeInputFormat, 
config.getBooleanOrDefault(HIVE_USE_PRE_APACHE_INPUT_FORMAT));
+String outputFormatClassName = getOutputFormatClassName(baseFileFormat);
+String serDeFormatClassName = getSerDeClassName(baseFileFormat);
+Map serdeProperties = getSerdeProperties(readAsOptimized);
+Map tableProperties = getTableProperties(schema);
+
+// Custom serde will not work with ALTER TABLE REPLACE COLUMNS
+// 
https://github.com/apache/hive/blob/release-1.1.0/ql/src/java/org/apache/hadoop/hive
+// /ql/exec/DDLTask.java#L3488
+syncClient.createTable(tableName, schema, inputFormatClassName,
+outputFormatClassName, serDeFormatClassName, serdeProperties, 
tableProperties);
+  }
 
-  // Custom serde will not work with ALTER TABLE REPLACE COLUMNS
-  // 
https://github.com/apache/hive/blob/release-1.1.0/ql/src/java/org/apache/hadoop/hive
-  // /ql/exec/DDLTask.java#L3488
-  syncClient.createTable(tableName, schema, 

[jira] [Updated] (HUDI-5697) Spark SQL re-lists Hudi table after every SQL operations

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5697:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Spark SQL re-lists Hudi table after every SQL operations
> 
>
> Key: HUDI-5697
> URL: https://issues.apache.org/jira/browse/HUDI-5697
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, after most DML operations in Spark SQL, Hudi invokes 
> `Catalog.refreshTable`
> Prior to Spark 3.2, this was essentially doing the following:
>  # Invalidating relation cache (forcing next time for relation to be 
> re-resolved, creating new FileIndex, listing files, etc)
>  # Trigger cascading invalidation (re-caching) of the cached data (in 
> CacheManager)
> As of Spark 3.2 it now additionally does `LogicalRelation.refresh` for ALL 
> tables (previously this was only done for Temporary Views), therefore 
> entailing whole table to be re-listed again by triggering `FileIndex.refresh` 
> which might be costly operation.
>  
> We should revert back to preceding behavior from Spark 3.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5688:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> schema field of EmptyRelation subtype of BaseRelation should not be null
> 
>
> Key: HUDI-5688
> URL: https://issues.apache.org/jira/browse/HUDI-5688
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Pramod Biligiri
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 
> 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, 
> Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined 
> schema for it as well (as represented by the userSpecifiedSchema field in 
> DataSource.scala), then the EmptyRelation returned by 
> DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
> breaks the contract of Spark's BaseRelation, where the schema is a StructType 
> but is not expected to be null.
> Module versions: current apache-hudi master (commit hash 
> abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
>             .option("hoodie.datasource.query.type", "incremental") 
> .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
>   at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
>   ... 50 elided  
> Find attached a few screenshots which show the code flow and the buggy state 
> of the variables. Also find attached a Java file and pom.xml that can be used 
> to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change: 
> [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira 
> (https://issues.apache.org/jira/browse/HUDI-4363) and PR 
> (https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5716) Fix Partitioners to avoid assuming that parallelism is always present

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5716:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix Partitioners to avoid assuming that parallelism is always present
> -
>
> Key: HUDI-5716
> URL: https://issues.apache.org/jira/browse/HUDI-5716
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, `Partitioner` impls assume that there's always going to be some 
> parallelism level.
> This has not been issue previously for the following reasons:
>  * RDDs always have inherent "parallelism" level defined as the # of 
> partitions they operating upon. However for Dataset (SparkPlan) that's not 
> necessarily the case (som SparkPlans might not be reporting the output 
> partitioning)
>  * Additionally, we have had the default parallelism level set in our configs 
> before which meant that we'd prefer that over the actual incoming dataset.
> However, since we've recently removed default parallelism value from our 
> configs we now need to fix Partitioners to make sure these are not assuming 
> that parallelism is always going to be present.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5609) Hudi table not queryable by SQL on Databricks Spark

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5609:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Hudi table not queryable by SQL on Databricks Spark
> ---
>
> Key: HUDI-5609
> URL: https://issues.apache.org/jira/browse/HUDI-5609
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Customer: I’ve tried this with 0.12.2 and still receive the same error. does 
> the table format version also need to be updated? i.e. we’re writing with 
> Hudi 0.11.1 using EMR but reading from Databricks using Hudi 0.12.2 and Spark 
> 3.3.
>  
> What have been tried so far on 0.12.2:
>  # 
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/2...@2x.png!
>  SparkSQL
> so just tried Spark SQL and doesn’t work (different issue)
> SET hoodie.file.index.enable=false
> select count(*) from validated_sales;
> returns 0 count but no errors
> 2. 
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/2...@2x.png!
>  when running via pyspark
> %python
> df = spark.read.format('hudi')\
> .load('s3:///validated_sales/*/*/*')
> df.count()
> all is good with 0.12.2 Hudi and Databricks 11.3 (spark 3.3).
> 3. 
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/2...@2x.png!
>  without the wildcard in pyspark
> %python
> df = spark.read.format('hudi')\
> .load('s3:///validated_sales')
> df.count()
> count = 0
> 4. 
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/2...@2x.png!
>  without wildcard but with recursive option set in pyspark
> %python
> df = spark.read.format('hudi')\
> .option("recursiveFileLookup","true")\
> .load('s3:///validated_sales')
> df.count()
> count = 250k 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5619) Fix HoodieTableFileSystemView inefficient latest base-file lookups

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5619:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HoodieTableFileSystemView inefficient latest base-file lookups
> --
>
> Key: HUDI-5619
> URL: https://issues.apache.org/jira/browse/HUDI-5619
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Currently, HoodieTableFileSystemView when looking up latest base-file in a 
> single file-group [have to process the whole 
> partition|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java#L584],
>  which is obviously not very efficient.
> Instead, we should be able to lookup and process just the file-group in 
> question.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5597) Deltastreamer ingestion fails when consistent hashing index is used

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5597:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Deltastreamer ingestion fails when consistent hashing index is used
> ---
>
> Key: HUDI-5597
> URL: https://issues.apache.org/jira/browse/HUDI-5597
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.13.0
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> I tested consistent hashing index w/ a deltastreamer pipeline. but it failed 
> w/ below exception. Same pipeline works w/o any issues for default index. 
>  
> Additional configs I used 
> hoodie.index.type=BUCKET
> hoodie.index.bucket.engine=CONSISTENT_HASHING
> hoodie.bucket.index.num.buckets=4
> hoodie.compact.inline.max.delta.commits=2
>  
> I have some parquet data in a dir. I am starting a deltastreamer w/ 
> PArquetDFS source for mor table. setting the additional configs as shown 
> above.
> I did make some minor fixes to my branch (compared to master), but thats only 
> to enable inline compaction w/ deltastreamer continuous mode. In general, 
> only async compaction is allowed w/ detlastreamer continuous. I just wanted 
> to test inline for now. but apart from that, I am using latest master to 
> test. 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 100.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 100.0 (TID 176, localhost, executor driver): 
> org.apache.hudi.exception.HoodieException: Unsupported Operation Exception
>   at 
> org.apache.hudi.common.util.collection.BitCaskDiskMap.values(BitCaskDiskMap.java:303)
>   at 
> org.apache.hudi.common.util.collection.ExternalSpillableMap.values(ExternalSpillableMap.java:275)
>   at java.util.Collections$UnmodifiableMap.values(Collections.java:1487)
>   at 
> org.apache.hudi.io.HoodieMergeHandle.writeIncomingRecords(HoodieMergeHandle.java:397)
>   at 
> org.apache.hudi.io.HoodieMergeHandle.close(HoodieMergeHandle.java:409)
>   at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:168)
>   at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdateInternal(HoodieSparkCopyOnWriteTable.java:224)
>   at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.handleUpdate(HoodieSparkCopyOnWriteTable.java:215)
>   at 
> org.apache.hudi.table.action.compact.CompactionExecutionHelper.writeFileAndGetWriteStats(CompactionExecutionHelper.java:64)
>   at 
> org.apache.hudi.table.action.compact.HoodieCompactor.compact(HoodieCompactor.java:231)
>   at 
> org.apache.hudi.table.action.compact.HoodieCompactor.lambda$compact$9cd4b1be$1(HoodieCompactor.java:129)
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at 

[jira] [Updated] (HUDI-5641) Streamline Advanced Schema Evolution flow

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5641:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Streamline Advanced Schema Evolution flow
> -
>
> Key: HUDI-5641
> URL: https://issues.apache.org/jira/browse/HUDI-5641
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Currently, Schema Evolution not always applied consistently and sometimes is 
> re-applied multiple times causing issues for HoodieSparkRecord 
> implementations (that is optimized to reuse underlying buffer):
>  # HoodieMergeHelper would apply SE transformer, then
>  # HoodieMergeHandle would run rewriteRecordWithNewSchema again



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5602) Troubleshoot METADATA_ONLY bootstrapped table not being able to read back partition path

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5602:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Troubleshoot METADATA_ONLY bootstrapped table not being able to read back 
> partition path
> 
>
> Key: HUDI-5602
> URL: https://issues.apache.org/jira/browse/HUDI-5602
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.2
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> In [https://github.com/apache/hudi/pull/7461] after enabling matching of the 
> whole payload rather than just record counts, it's been discovered that Hudi 
> isn't able to read back partition-path after running METADATA_ONLY bootstrap, 
> leading to a test failure (it's annotated w/ the TODO and this Jira in the 
> test suite)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5608) Support decimals w/ precision > 30 in Column Stats

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5608:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support decimals w/ precision > 30 in Column Stats
> --
>
> Key: HUDI-5608
> URL: https://issues.apache.org/jira/browse/HUDI-5608
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.2
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> As reported in: [https://github.com/apache/hudi/issues/7732]
>  
> Currently we've limited precision of the supported decimals at 30 assuming 
> that this number is reasonably high to cover 99% of use-cases, but it seems 
> like there's still a demand for even larger Decimals.
> The challenge is however to balance the need to support longer Decimals vs 
> storage space we have to provision for each one of them.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5575) Support any record key generation along w/ any partition path generation for row writer

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5575:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support any record key generation along w/ any partition path generation for 
> row writer
> ---
>
> Key: HUDI-5575
> URL: https://issues.apache.org/jira/browse/HUDI-5575
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> HUDI-5535 adds support for record key generation along w/ any partition path 
> generation. It also separates the record key generation and partition path 
> generation into separate interfaces.
> This jira aims to add similar support for the row writer path in spark.
> cc [~shivnarayan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5574) Support auto record key generation with Spark SQL

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5574:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Support auto record key generation with Spark SQL
> -
>
> Key: HUDI-5574
> URL: https://issues.apache.org/jira/browse/HUDI-5574
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Lokesh Jain
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: release-0.14.0-blocker
> Fix For: 0.14.0
>
>
> HUDI-2681 adds support for auto record key generation with spark dataframes. 
> This Jira aims to add support for the same with spark sql.
> One of the changes required here as pointed out by [~kazdy] is that 
> SQL_INSERT_MODE would need to be handled here. In this case if 
> SQL_INSERT_MODE mode is set to strict, the insert should fail.
> cc [~shivnarayan] 
> Essentially, based on this patch 
> ([https://github.com/apache/hudi/pull/7681),|https://github.com/apache/hudi/pull/7681,]
> we want to ensure spark-sql writes also supports auto generation of record 
> keys. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5588) Fix Metadata table validator to deduce valid partitions when first commit where partition was added is failed

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5588:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix Metadata table validator to deduce valid partitions when first commit 
> where partition was added is failed
> -
>
> Key: HUDI-5588
> URL: https://issues.apache.org/jira/browse/HUDI-5588
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> Metadata validation sometimes fails due to test code issue. 
> FS based listing shows 0 partitions, while MDT listing shows all 100 
> partitions. Its an issue w/ validator code.
>  
> actual timeline:
> ls -ltr tbl1/hoodie_table/.hoodie/ total 720 drwxr-xr-x 2 nsb staff 64 Jan 17 
> 18:45 archived drwxr-xr-x 4 nsb staff 128 Jan 17 18:45 metadata -rw-r--r-- 1 
> nsb staff 808 Jan 17 18:45 hoodie.properties -rw-r--r-- 1 nsb staff 1230 Jan 
> 17 18:45 20230117214546000.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 
> 18:45 20230117214546000.rollback.inflight -rw-r--r-- 1 nsb staff 1414 Jan 17 
> 18:46 20230117214546000.rollback -rw-r--r-- 1 nsb staff 1230 Jan 17 18:47 
> 20230117214701512.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 18:47 
> 20230117214701512.rollback.inflight -rw-r--r-- 1 nsb staff 1414 Jan 17 18:47 
> 20230117214701512.rollback -rw-r--r-- 1 nsb staff 15492 Jan 17 18:48 
> 20230117214831503.rollback.requested -rw-r--r-- 1 nsb staff 0 Jan 17 18:48 
> 20230117214831503.rollback.inflight -rw-r--r-- 1 nsb staff 0 Jan 17 18:48 
> 20230117214848714.deltacommit.requested -rw-r--r-- 1 nsb staff 16359 Jan 17 
> 18:48 20230117214831503.rollback -rw-r--r-- 1 nsb staff 69698 Jan 17 18:49 
> 20230117214848714.deltacommit.inflight -rw-r--r-- 1 nsb staff 0 Jan 17 18:50 
> 20230117215006714.deltacommit.requested -rw-r--r-- 1 nsb staff 94423 Jan 17 
> 18:50 20230117214848714.deltacommit -rw-r--r-- 1 nsb staff 142198 Jan 17 
> 18:50 20230117215006714.deltacommit.inflight
>  
>  
> atleast there is one successfull commit 20230117214848714.deltacommit.
>  
> but our validator code checks for creation time of partition and considers 
> that as valid partition only if that particular commit is succeded.
> {code:java}
> List allPartitionPathsFromFS = 
> FSUtils.getAllPartitionPaths(engineContext, basePath, false, 
> cfg.assumeDatePartitioning);
> HoodieTimeline completedTimeline = 
> metaClient.getActiveTimeline().filterCompletedInstants();
> // ignore partitions created by uncommitted ingestion.
> allPartitionPathsFromFS = 
> allPartitionPathsFromFS.stream().parallel().filter(part -> {
>   HoodiePartitionMetadata hoodiePartitionMetadata =
>   new HoodiePartitionMetadata(metaClient.getFs(), 
> FSUtils.getPartitionPath(basePath, part));
>   Option instantOption = 
> hoodiePartitionMetadata.readPartitionCreatedCommitTime();
>   if (instantOption.isPresent()) {
> String instantTime = instantOption.get();
> return completedTimeline.containsOrBeforeTimelineStarts(instantTime);
>   } else {
> return false;
>   }
> }).collect(Collectors.toList()); {code}
>  
> we need to fix this
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5444) FileNotFound issue w/ metadata enabled

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5444:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> FileNotFound issue w/ metadata enabled
> --
>
> Key: HUDI-5444
> URL: https://issues.apache.org/jira/browse/HUDI-5444
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.14.0
>
>
> stacktrace
> {code:java}
> Caused by: java.io.FileNotFoundException: File not found: 
> gs://TBL_PATH/op_cmpny_cd=WMT.COM/order_placed_dt=2022-12-08/441e7909-6a62-45ac-b9df-dd0386574f52-0_607-17-2433_20221208132316380.parquet
>         at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1082)
>  {code}
>  
> 20221208133227028 (RB_C10)
> 20221208133227028001 MDT compaction
> 20221208132316380 (C10)
> 20221208133647230
> DT
>  8   │ 20221202234515099 │ rollback │ COMPLETED │ Rolls back        │ 12-02 
> 15:45:18 │ 12-02 15:45:18 │ 12-02 15:45:33 ║
> ║     │                   │          │           │ 2022120413756 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 9   │ 20221208133227028 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:32:33 │ 12-08 05:32:33 │ 12-08 05:32:44 ║
> ║     │                   │          │           │ 20221208132316380 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> ║ 10  │ 20221208133647230 │ rollback │ COMPLETED │ Rolls back        │ 12-08 
> 05:36:47 │ 12-08 05:36:48 │ 12-08 05:36:57 ║
> ║     │                   │          │           │ 20221208133222583 │        
>         │                │                ║
> ╟─┼───┼──┼───┼───┼┼┼╢
> MDT timeline: 
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:32 
> 20221208133227028.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:32 
> 20221208133227028.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:32 20221208133227028.deltacommit
> -rw-r--r--@ 1 nsb  staff  1938 Dec  8 05:34 
> 20221208133227028001.compaction.requested
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208133227028001.compaction.inflight
> -rw-r--r--@ 1 nsb  staff  7556 Dec  8 05:34 20221208133227028001.commit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:34 
> 20221208132316380.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff  3049 Dec  8 05:34 
> 20221208132316380.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  8207 Dec  8 05:35 20221208132316380.deltacommit
> -rw-r--r--@ 1 nsb  staff     0 Dec  8 05:36 
> 20221208133647230.deltacommit.requested
> -rw-r--r--@ 1 nsb  staff   548 Dec  8 05:36 
> 20221208133647230.deltacommit.inflight
> -rw-r--r--@ 1 nsb  staff  6042 Dec  8 05:36 20221208133647230.deltacommit
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5507) SparkSQL can not read the latest change data without execute "refresh table xxx"

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5507:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> SparkSQL can not read the latest change data without execute "refresh table 
> xxx"
> 
>
> Key: HUDI-5507
> URL: https://issues.apache.org/jira/browse/HUDI-5507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.12.2
>Reporter: Danny Chen
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5557) Wrong candidate files found in metadata table

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5557:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Wrong candidate files found in metadata table 
> --
>
> Key: HUDI-5557
> URL: https://issues.apache.org/jira/browse/HUDI-5557
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, spark-sql
>Affects Versions: 0.12.2
>Reporter: ruofan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.3, 0.14.0
>
>
> Suppose the hudi table has five fields, but only two fields are indexed. When 
> part of the filter condition in SQL comes from index fields and the other 
> part comes from non-index fields, the candidate files queried from the 
> metadata table are wrong.
> For example following hudi table schema
> {code:java}
> name: varchar(128)
> age: int
> addr: varchar(128)
> city: varchar(32)
> job: varchar(32) {code}
> table properties
> {code:java}
> hoodie.table.type=MERGE_ON_READ
> hoodie.metadata.enable=true
> hoodie.metadata.index.column.stats.enable=true
> hoodie.metadata.index.column.stats.column.list='name,city'
> hoodie.enable.data.skipping=true {code}
> sql
> {code:java}
> select * from hudi_table where name='tom' and age=18;  {code}
> if we set hoodie.enable.data.skipping=false, the data can be found. But if we 
> set hoodie.enable.data.skipping=true, we can't find the expected data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5463:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5442) Fix HiveHoodieTableFileIndex to use lazy listing

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5442:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HiveHoodieTableFileIndex to use lazy listing
> 
>
> Key: HUDI-5442
> URL: https://issues.apache.org/jira/browse/HUDI-5442
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, trino-presto
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, 
> using eager listing only.  This leads to scanning all table partitions in the 
> file index, regardless of the queryPaths provided (for Trino Hive connector, 
> only one partition is passed in).
> {code:java}
> public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths,
> Option specifiedQueryInstant,
> boolean shouldIncludePendingCommits
> ) {
>   super(engineContext,
>   metaClient,
>   configProperties,
>   queryType,
>   queryPaths,
>   specifiedQueryInstant,
>   shouldIncludePendingCommits,
>   true,
>   new NoopCache(),
>   false);
> } {code}
> After flipping it to true for testing, the following exception is thrown.
> {code:java}
> io.trino.spi.TrinoException: Failed to parse partition column values from the 
> partition-path: likely non-encoded slashes being used in partition column's 
> values. You can try to work this around by switching listing mode to eager
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
>     at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
>     at io.trino.$gen.Trino_39220221217_092723_2.run(Unknown Source)
>     at 
> io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>     at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to parse 
> partition column values from the partition-path: likely non-encoded slashes 
> being used in partition column's values. You can try to work this around by 
> switching listing mode to eager
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
>     at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
>     at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
>     at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>     at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>     at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
>     at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
>     at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
>     at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
>     at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
>     at 
> io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
>     at 
> io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
>     at 

[jira] [Updated] (HUDI-5520) Fail MDT when list of log files grows unboundedly

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5520:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fail MDT when list of log files grows unboundedly
> -
>
> Key: HUDI-5520
> URL: https://issues.apache.org/jira/browse/HUDI-5520
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5436) Auto repair tool for MDT out of sync

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5436:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Auto repair tool for MDT out of sync
> 
>
> Key: HUDI-5436
> URL: https://issues.apache.org/jira/browse/HUDI-5436
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> Can we write a spark-submit to repair any out of sync issues w/ MDT. for eg, 
> if MDT validation failed for a given table, we don't have a good way to fix 
> the MDT.
> So, we should develop a sparksubmit job which will try to deduce from which 
> commit the out of sync happens and try to fix just the delta.
>  
> idea here is:
> Try running validation job for latest files at every commit starting from 
> latest in reverse chronological order. At some point validation will succeed. 
> Lets call it commit N.
> we can add savepoint to MDT at commit N and restore the table to that commit 
> N.
> and then we can take any new commits after commitN from data table and apply 
> them one by one to MDT.
>  
> Once complete, we can run validation tool again to ensure its in good shape.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5374) Use KeyGeneratorFactory class for instantiating a KeyGenerator

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5374:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Use KeyGeneratorFactory class for instantiating a KeyGenerator
> --
>
> Key: HUDI-5374
> URL: https://issues.apache.org/jira/browse/HUDI-5374
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently the configs hoodie.datasource.write.keygenerator.class and 
> hoodie.datasource.write.keygenerator.type are used in multiple areas to 
> create a key generator. The idea is to reuse the *KeyGeneratorFactory classes 
> for instantiating KeyGenerator.
> The Jira adds a KeyGeneratorFactory base class and 
> HoodieSparkKeyGeneratorFactory, HoodieAvroKeyGeneratorFactory extends this 
> base class. These classes are then used in code for creating KeyGenerator.
> Based on Github issue: [https://github.com/apache/hudi/issues/7291]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5271) Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5271:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception
> 
>
> Key: HUDI-5271
> URL: https://issues.apache.org/jira/browse/HUDI-5271
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Exception detail in https://github.com/apache/hudi/issues/7284



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5364) Make sure Hudi's Column Stats are wired into Spark's relation stats

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5364:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure Hudi's Column Stats are wired into Spark's relation stats
> ---
>
> Key: HUDI-5364
> URL: https://issues.apache.org/jira/browse/HUDI-5364
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, we're leveraging CSI exclusively to better prune the target files.
> Additionally, we should wire in stats from CSI into Spark's 
> `CatalogStatistics` which in turn will be leveraged by Spark's Optimization 
> rules for better planning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5385) Make behavior of keeping File Writers open configurable

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5385:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make behavior of keeping File Writers open configurable
> ---
>
> Key: HUDI-5385
> URL: https://issues.apache.org/jira/browse/HUDI-5385
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, when writing in Spark we will be keeping the File Writers for 
> individual partitions open as long as we're processing the batch which 
> entails that all of the data written out will be kept in memory (at least the 
> last row-group in case of Parquet writers) until batch is fully processed and 
> all of the writers are closed.
> While this allows us to better control how many files are created in every 
> partition (we keep the writer open and hence we don't need to create a new 
> file when a new record comes in), this brings a penalty of keeping all of the 
> data in memory potentially leading to OOMs, longer GC cycles, etc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5322) Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's schema

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5322:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Bulk-insert (row-writing) is not rewriting incoming dataset into Writer's 
> schema
> 
>
> Key: HUDI-5322
> URL: https://issues.apache.org/jira/browse/HUDI-5322
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.14.0
>
>
> Row-writing Bulk-insert have to rewrite incoming dataset into the finalized 
> Writer's schema, instead it's currently just using incoming dataset as is, 
> deviating in semantic from non-Row-writing flow (alas other operations)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5405) Avoid using Projections in generic Merge Into DMLs

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5405:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Avoid using Projections in generic Merge Into DMLs
> --
>
> Key: HUDI-5405
> URL: https://issues.apache.org/jira/browse/HUDI-5405
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, `MergeIntoHoodieTableCommand` squarely relies on semantic 
> implemented by `ExpressionPayload` to be able to insert/update records.
> While this is necessary since MIT semantic enables users to do sophisticated 
> and fine-grained updates (for ex, partial updating), this is not necessary in 
> the most generic case:
>  
> {code:java}
> MERGE INTO target
> USING ... source
> ON target.id = source.id
> WHEN MATCHED THEN UPDATE *
> WHEN NOT MATCHED THEN INSERT *{code}
> This is essentially just a SQL way of implementing an upsert – if there are 
> matching records in the table we update them, otherwise – insert.
> In this case there's actually no need to use ExpressionPayload at all, and we 
> can just simply use normal Hudi upserting flow to handle it (avoiding all of 
> the ExpressionPayload overhead)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5361) Propagate Hudi properties set in Spark's SQLConf to Hudi

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5361:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Propagate Hudi properties set in Spark's SQLConf to Hudi
> 
>
> Key: HUDI-5361
> URL: https://issues.apache.org/jira/browse/HUDI-5361
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Alexey Kudinkin
>Assignee: Jonathan Vexler
>Priority: Critical
> Fix For: 0.14.0
>
>
> Currently, the only property we propagate from Spark's SQLConf is 
> hoodie.metadata.enable.
> Instead we should actually pull all of the Hudi related configs from SQLConf 
> and pass them to Hudi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5438) Benchmark calls w/ metadata enabled and ensure no calls to direct FS

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5438:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Benchmark calls w/ metadata enabled and ensure no calls to direct FS
> 
>
> Key: HUDI-5438
> URL: https://issues.apache.org/jira/browse/HUDI-5438
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> we need to benchmark calls to S3 (s3 access logs) and ensure when metadata is 
> enabled, we don't make any direct calls to FS. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5352) Jackson fails to serialize LocalDate when updating Delta Commit metadata

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5352:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Jackson fails to serialize LocalDate when updating Delta Commit metadata
> 
>
> Key: HUDI-5352
> URL: https://issues.apache.org/jira/browse/HUDI-5352
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, running TestColumnStatsIndex on Spark 3.3 fails the MOR tests due 
> to Jackson not being able to serialize LocalData as is and requiring 
> additional JSR310 dependency.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5319) NPE in Bloom Filter Index

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5319:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> NPE in Bloom Filter Index
> -
>
> Key: HUDI-5319
> URL: https://issues.apache.org/jira/browse/HUDI-5319
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
>
> {code:java}
> /12/02 11:05:49 WARN TaskSetManager: Lost task 3.0 in stage 1098.0 (TID 
> 1300185) (ip-172-31-23-246.us-east-2.compute.internal executor 10): 
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieIndexException: 
> Error checking bloom filter index.
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>         at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
>         at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:183)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>         at org.apache.spark.scheduler.Task.run(Task.scala:138)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieIndexException: Error checking 
> bloom filter index.
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>         at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>         ... 16 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:87)
>         at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:92)
>         ... 18 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #8445: [HUDI-3088] Use Spark 3.2 as default Spark version

2023-05-22 Thread via GitHub


hudi-bot commented on PR #8445:
URL: https://github.com/apache/hudi/pull/8445#issuecomment-1558546855

   
   ## CI report:
   
   * fe494c5e09f8c3a57446834c86ad82904bcda585 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #8775: [HUDI-5584] Metasync update props when changed

2023-05-22 Thread via GitHub


xushiyan commented on code in PR #8775:
URL: https://github.com/apache/hudi/pull/8775#discussion_r1201537869


##
hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java:
##
@@ -477,13 +472,19 @@ private static Table getTable(AWSGlue awsGlue, String 
databaseName, String table
 }
   }
 
-  private static void updateTableParameters(AWSGlue awsGlue, String 
databaseName, String tableName, Map updatingParams, boolean 
shouldReplace) {
-final Map newParams = new HashMap<>();
+  private static boolean updateTableParameters(AWSGlue awsGlue, String 
databaseName, String tableName, Map updatingParams) {
+if (isNullOrEmpty(updatingParams)) {

Review Comment:
   i don't think we support the delete behavior. even today, if you check the 
caller `updateTableProperties()` it'll also return early if input params is 
empty



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4944) The encoded slash (%2F) in partition path is not properly decoded during Spark read

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4944:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> The encoded slash (%2F) in partition path is not properly decoded during 
> Spark read
> ---
>
> Key: HUDI-4944
> URL: https://issues.apache.org/jira/browse/HUDI-4944
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: Untitled
>
>
> When the source partitioned parquet table of the bootstrap operation has the 
> encoded slash (%2F) in the partition path, e.g., 
> "partition_path=2015%2F03%2F17", after the metadata-only bootstrap with the 
> bootstrap indexing storing the data file path containing the partition path 
> with the encoded slash (%2F), the target bootstrapped Hudi table cannot be 
> read due to FileNotFound exception.  The root cause is that the encoding of 
> the slash is lost when creating the new Path instance with the URI (see 
> below, that "partition_path=2015/03/17" instead of 
> "partition_path=2015%2F03%2F17").
> {code:java}
> Caused by: java.io.FileNotFoundException: File does not exist: 
> hdfs://localhost:62738/user/ethan/test_dataset_bootstrapped/partition_path=2015/03/17/e0fa3466-d3bc-43f7-b586-2f95d8745095_3-161-675_01.parquet
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1528)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1521)
>     at 
> org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
>     at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(Spark24HoodieParquetFileFormat.scala:131)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(Spark24HoodieParquetFileFormat.scala:130)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(Spark24HoodieParquetFileFormat.scala:134)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.Spark24HoodieParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(Spark24HoodieParquetFileFormat.scala:111)
>     at 
> org.apache.hudi.HoodieDataSourceHelper$$anonfun$buildHoodieParquetReader$1.apply(HoodieDataSourceHelper.scala:71)
>     at 
> org.apache.hudi.HoodieDataSourceHelper$$anonfun$buildHoodieParquetReader$1.apply(HoodieDataSourceHelper.scala:70)
>     at org.apache.hudi.HoodieBootstrapRDD.compute(HoodieBootstrapRDD.scala:60)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) {code}
> The path conversion that causes the problem is in the code below.  "new 
> URI(file.filePath)" decodes the "%2F" and converts the slash.
> Spark24HoodieParquetFileFormat (same for Spark32PlusHoodieParquetFileFormat)
> {code:java}
> val fileSplit =
>   new FileSplit(new Path(new URI(file.filePath)), file.start, file.length, 
> Array.empty) {code}
> This fails the tests below and we need to use a partition path without 
> slashes in the value for now: 
> TestHoodieDeltaStreamer#testBulkInsertsAndUpsertsWithBootstrap
> ITTestHoodieDemo#testParquetDemo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4937:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4738) [MOR] Bloom Index missing new records inserted into Log files

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4738:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> [MOR] Bloom Index missing new records inserted into Log files
> -
>
> Key: HUDI-4738
> URL: https://issues.apache.org/jira/browse/HUDI-4738
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Currently, Bloom Index is implemented under following assumption that 
> _file-group (once written), has fixed set of records that could not be 
> expanded_ (this is encoded t/h assumption that at least one version of every 
> record w/in the file group is stored w/in its base file).
> This is relied upon when we tag incoming records w/ the locations of the 
> file-groups they could potentially belong to (in case, when such records are 
> updates), by fetching the Bloom Index info from either a) base-file or b) 
> record in MT Bloom Index associated w/ particular file-group id.
>  
> However this assumption is not always true, since it's possible for _new_ 
> records to be inserted into the log-files, which would mean that the records 
> key-set of a single file-group could expand. This could lead to potentially 
> some records that were previously written to log-files to be duplicated.
>  
> We need to reconcile these 2 aspects and do either of:
>  # Disallow expansion of the file-group records' set (by not allowing inserts 
> into log-files)
>  # Fix Bloom Index implementation to also check log-files during tagging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5080) UnpersistRdds unpersist all rdds in the spark context

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5080:

Fix Version/s: (was: 0.13.1)

> UnpersistRdds unpersist all rdds in the spark context
> -
>
> Key: HUDI-5080
> URL: https://issues.apache.org/jira/browse/HUDI-5080
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> In SparkRDDWriteClient, we have a method to clean up persisted Rdds to free 
> up the space occupied. 
> [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java#L584]
> But the issue is, it cleans up all persisted rdds in the given spark context. 
> This will impact, async compaction or any other async table services running. 
> or even if there are multiple streams writing to different tables, this will 
> be cause a huge impact. 
>  
> This also needs to be fixed with DeltaSync. 
> [https://github.com/apache/hudi/blob/b78c3441c4e28200abec340eaff852375764cbdb/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L345]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4947) Missing .hoodie/hoodie.properties in Hudi table

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4947:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Missing .hoodie/hoodie.properties in Hudi table
> ---
>
> Key: HUDI-4947
> URL: https://issues.apache.org/jira/browse/HUDI-4947
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> At some point, the ingestion job reports that hoodie.properties is missing 
> and neither of hoodie.properties nor hoodie.properties.backup is present.  
> Sample stacktrace:
>  
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieIOException: Could not load Hoodie 
> properties from s3://.../.hoodie/hoodie.properties
> at 
> org.apache.hudi.common.table.HoodieTableConfig.(HoodieTableConfig.java:254)
> at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:125)
> at 
> org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:78)
> at 
> org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:668)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$getHoodieTableConfig$1(HoodieSparkSqlWriter.scala:756)
> at scala.Option.getOrElse(Option.scala:189)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.getHoodieTableConfig(HoodieSparkSqlWriter.scala:757)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:85)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:165) 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4922) Presto query of bootstrapped data returns null

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4922:

Fix Version/s: 0.14.0
   (was: 0.13.1)

>  Presto query of bootstrapped data returns null
> ---
>
> Key: HUDI-4922
> URL: https://issues.apache.org/jira/browse/HUDI-4922
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
>
> https://github.com/apache/hudi/issues/6532



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5092) Querying Hudi table throws NoSuchMethodError in Databricks runtime

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5092:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Querying Hudi table throws NoSuchMethodError in Databricks runtime 
> ---
>
> Key: HUDI-5092
> URL: https://issues.apache.org/jira/browse/HUDI-5092
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: image (1).png, image.png
>
>
> Originally reported by the user: 
> [https://github.com/apache/hudi/issues/6137]
>  
> Crux of the issue is that Databricks's DBR runtime diverges from OSS Spark, 
> and in that case `FileStatusCache` API is very clearly divergent b/w the two. 
> There are a few approaches we can take: 
>  # Avoid reliance on Spark's FIleStatusCache implementation altogether and 
> rely on our own one
>  # Apply more staggered approach where we first try to use Spark's 
> FileStatusCache and if it doesn't match expected API, we fallback to our own 
> impl
>  
> Approach # 1  would actually mean that we're not sharing cache implementation 
> w/ Spark, which in turn would entail that in some cases we might be keeping 2 
> instances of the same cache. Approach # 2 remediates that and allows us to 
> only fallback in case API is not compatible. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5015) Cleaner does not work properly when metadata table is enabled

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5015:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Cleaner does not work properly when metadata table is enabled
> -
>
> Key: HUDI-5015
> URL: https://issues.apache.org/jira/browse/HUDI-5015
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.14.0
>
>
> Please see [https://github.com/apache/hudi/pull/6926] for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4958) Provide accurate numDeletes in commit metadata

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4958:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Provide accurate numDeletes in commit metadata
> --
>
> Key: HUDI-4958
> URL: https://issues.apache.org/jira/browse/HUDI-4958
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> When doing a simple computation of {{numInserts - numDeletes}} for all the 
> commits, this leads to negative total records.  Need to check if number of 
> inserts and deletes are accurate when both inserts and deletes exist in the 
> same input batch for upsert.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5229) Add flink avro version entry in root pom

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5229:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Add flink avro version entry in root pom
> 
>
> Key: HUDI-5229
> URL: https://issues.apache.org/jira/browse/HUDI-5229
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4777) Flink gen bucket index of mor table not consistent with spark lead to duplicate bucket issue

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4777:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink gen bucket index of mor table not consistent with spark lead to 
> duplicate bucket issue
> 
>
> Key: HUDI-4777
> URL: https://issues.apache.org/jira/browse/HUDI-4777
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4921:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4854) Deltastreamer does not respect partition selector regex for metadata-only bootstrap

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4854:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Deltastreamer does not respect partition selector regex for metadata-only 
> bootstrap
> ---
>
> Key: HUDI-4854
> URL: https://issues.apache.org/jira/browse/HUDI-4854
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4852) Incremental sync not updating pending file groups under clustering

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4852:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Incremental sync not updating pending file groups under clustering
> --
>
> Key: HUDI-4852
> URL: https://issues.apache.org/jira/browse/HUDI-4852
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Assignee: Surya Prasanna Yalla
>Priority: Critical
> Fix For: 0.14.0
>
>
> Pending file groups under clustering are not updated through incremental sync 
> calls. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4818) Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4818:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex
> ---
>
> Key: HUDI-4818
> URL: https://issues.apache.org/jira/browse/HUDI-4818
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently using `CustomKeyGenerator` with the partition-path config 
> \{hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/
> {code:java}
> Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to 
> `LongType` for partition column `ts_ms`
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
>   at 
> org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193)
>  {code}
>  
> This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema 
> at XXX
> where it properly handles only `TimestampBasedKeyGenerator`s but not the 
> other key-generators that might be changing the data-type of the 
> partition-value as compared to the source partition-column (in this case it 
> has `ts` as a long in the source table schema, but it produces 
> partition-value as string)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] bvaradar commented on a diff in pull request #8452: [HUDI-6077] Add more partition push down filters

2023-05-22 Thread via GitHub


bvaradar commented on code in PR #8452:
URL: https://github.com/apache/hudi/pull/8452#discussion_r1201504641


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -96,11 +109,32 @@ public List 
getPartitionPathWithPathPrefixes(List relativePathPr
   }
 
   private List getPartitionPathWithPathPrefix(String 
relativePathPrefix) throws IOException {
+return 
getPartitionPathWithPathPrefixUsingFilterExpression(relativePathPrefix, null, 
null);
+  }
+
+  private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String relativePathPrefix,
+   
Types.RecordType partitionFields,
+   
Expression expression) throws IOException {
 List pathsToList = new CopyOnWriteArrayList<>();
 pathsToList.add(StringUtils.isNullOrEmpty(relativePathPrefix)
-? new Path(datasetBasePath) : new Path(datasetBasePath, 
relativePathPrefix));
+? dataBasePath.get() : new Path(dataBasePath.get(), 
relativePathPrefix));
 List partitionPaths = new CopyOnWriteArrayList<>();
 
+int partitionLevel = -1;
+boolean needPushDownExpressions;
+// Not like `HoodieBackedTableMetadata`, since we don't know the exact 
partition levels here,
+// given it's possible that partition values contains `/`, which could 
affect
+// the result to get right `partitionValue` when listing paths, here we 
have
+// to make it more strict that `urlEncodePartitioningEnabled` must be 
enabled.
+// TODO better enable urlEncodePartitioningEnabled if 
hiveStylePartitioningEnabled is enabled?

Review Comment:
   Shall we disable partition push down filters by default when 
FileSystemBackedTableMetadata is used ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4632) Remove the force active property for flink1.14 profile

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4632:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Remove the force active property for flink1.14 profile
> --
>
> Key: HUDI-4632
> URL: https://issues.apache.org/jira/browse/HUDI-4632
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.1
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4643) MergeInto syntax WHEN MATCHED is optional but must be set

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4643:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> MergeInto syntax WHEN MATCHED is optional but must be set
> -
>
> Key: HUDI-4643
> URL: https://issues.apache.org/jira/browse/HUDI-4643
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
>  
> {code:java}
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long,
> | dt string
> |) using hudi
> | location '${tmp.getCanonicalPath}/$tableName'
> | tblproperties (
> | primaryKey ='id',
> | preCombineField = 'ts'
> | )
> """.stripMargin)
> // Insert data
> spark.sql(s"insert into $tableName select 1, 'a1', 1, 10, '2022-08-18'")
> spark.sql(
> s"""
> | merge into $tableName as t0
> | using (
> | select 1 as id, 'a1' as name, 11 as price, 110 as ts, '2022-08-19' as dt 
> union all
> | select 2 as id, 'a2' as name, 10 as price, 100 as ts, '2022-08-18' as dt
> | ) as s0
> | on t0.id = s0.id
> | when not matched then insert *
> """.stripMargin
> )
> {code}
>  
> {code:java}
> 11493 [Executor task launch worker for task 65] ERROR 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor  - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.AssertionError: assertion 
> failed: hoodie.payload.update.condition.assignments have not set
>     at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleInsertPartition(BaseSparkCommitActionExecutor.java:335)
>     at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:246)
>  
> {code}
>  
>  
> if  set hoodie.merge.allow.duplicate.on.inserts = true,The result is one more 
> record than expected:
> [1,a1,1.0,10,2022-08-18], [1,a1,11.0,110,2022-08-19], 
> [2,a2,10.0,100,2022-08-18]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4704:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> bulk insert overwrite table will delete the table and then recreate a table
> ---
>
> Key: HUDI-4704
> URL: https://issues.apache.org/jira/browse/HUDI-4704
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.12.0
>Reporter: zouxxyy
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.14.0
>
>
> When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite 
> will delete the table and then recreate a table, so time travel cannot be 
> performed.
>  
> {code:java}
> create table hudi_cow_test_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   'hoodie.sql.insert.mode' = 'non-strict',
>   'hoodie.sql.bulk.insert.enable' = 'true'
> );
> insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11';
> insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4542) Flink streaming query fails with ClassNotFoundException

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4542:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink streaming query fails with ClassNotFoundException
> ---
>
> Key: HUDI-4542
> URL: https://issues.apache.org/jira/browse/HUDI-4542
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Ethan Guo
>Priority: Critical
> Fix For: 0.14.0
>
> Attachments: Screen Shot 2022-08-04 at 17.17.42.png
>
>
> Environment: EMR 6.7.0 Flink 1.14.2
> Reproducible steps: Build Hudi Flink bundle from master
> {code:java}
> mvn clean package -DskipTests  -pl :hudi-flink1.14-bundle -am {code}
> Copy to EMR master node /lib/flink/lib
> Launch Flink SQL client:
> {code:java}
> cd /lib/flink && ./bin/yarn-session.sh --detached
> ./bin/sql-client.sh {code}
> Write a Hudi table with a few commits with metadata table enabled (no column 
> stats).  Then, run the following for the streaming query
> {code:java}
> CREATE TABLE t2(
>    uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>    name VARCHAR(10),
>    age INT,
>    ts TIMESTAMP(3),
>    `partition` VARCHAR(20)
>  )
>  PARTITIONED BY (`partition`)
>  WITH (
>    'connector' = 'hudi',
>    'path' = 's3a://',
>    'table.type' = 'MERGE_ON_READ',
>    'read.streaming.enabled' = 'true',  -- this option enable the streaming 
> read
>    'read.start-commit' = '20220803165232362', -- specifies the start commit 
> instant time
>    'read.streaming.check-interval' = '4' -- specifies the check interval for 
> finding new source commits, default 60s.
>  ); {code}
> {code:java}
> select * from t2; {code}
> {code:java}
> Flink SQL> select * from t2;
> 2022-08-05 00:12:43,635 INFO  org.apache.hadoop.metrics2.impl.MetricsConfig   
>              [] - Loaded properties from hadoop-metrics2.properties
> 2022-08-05 00:12:43,650 INFO  
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl            [] - Scheduled 
> Metric snapshot period at 300 second(s).
> 2022-08-05 00:12:43,650 INFO  
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl            [] - 
> s3a-file-system metrics system started
> 2022-08-05 00:12:47,722 INFO  org.apache.hadoop.fs.s3a.S3AInputStream         
>              [] - Switching to Random IO seek policy
> 2022-08-05 00:12:47,941 INFO  org.apache.hadoop.yarn.client.RMProxy           
>              [] - Connecting to ResourceManager at 
> ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:8032
> 2022-08-05 00:12:47,942 INFO  org.apache.hadoop.yarn.client.AHSProxy          
>              [] - Connecting to Application History server at 
> ip-172-31-9-157.us-east-2.compute.internal/172.31.9.157:10200
> 2022-08-05 00:12:47,942 INFO  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - No path for the flink jar passed. Using the location of 
> class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2022-08-05 00:12:47,942 WARN  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR 
> environment variable is set.The Flink YARN Client needs one of these to be 
> set to properly load the Hadoop configuration for accessing YARN.
> 2022-08-05 00:12:47,959 INFO  org.apache.flink.yarn.YarnClusterDescriptor     
>              [] - Found Web Interface 
> ip-172-31-3-92.us-east-2.compute.internal:39605 of application 
> 'application_1659656614768_0001'.
> [ERROR] Could not execute SQL statement. Reason:
> java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat{code}
> {code:java}
> 2022-08-04 17:12:59
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
>     at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
>     at 
> 

[jira] [Updated] (HUDI-4573) Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4573:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Fix HoodieMultiTableDeltaStreamer to write all tables in continuous mode
> 
>
> Key: HUDI-4573
> URL: https://issues.apache.org/jira/browse/HUDI-4573
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4541) Flink job fails with column stats enabled in metadata table due to NotSerializableException

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4541:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Flink job fails with column stats enabled in metadata table due to 
> NotSerializableException

> 
>
> Key: HUDI-4541
> URL: https://issues.apache.org/jira/browse/HUDI-4541
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: Screen Shot 2022-08-04 at 17.10.05.png
>
>
> Environment: EMR 6.7.0 Flink 1.14.2
> Reproducible steps: Build Hudi Flink bundle from master
> {code:java}
> mvn clean package -DskipTests  -pl :hudi-flink1.14-bundle -am {code}
> Copy to EMR master node /lib/flink/lib
> Launch Flink SQL client:
> {code:java}
> cd /lib/flink && ./bin/yarn-session.sh --detached
> ./bin/sql-client.sh {code}
> Run the following from the Flink quick start guide with metadata table, 
> column stats, and data skipping enabled
> {code:java}
> CREATE TABLE t1(
>   uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>   name VARCHAR(10),
>   age INT,
>   ts TIMESTAMP(3),
>   `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'hudi',
>   'path' = 's3a://',
>   'table.type' = 'MERGE_ON_READ', -- this creates a MERGE_ON_READ table, by 
> default is COPY_ON_WRITE
>   'metadata.enabled' = 'true', -- enables multi-modal index and metadata table
>   'hoodie.metadata.index.column.stats.enable' = 'true', -- enables column 
> stats in metadata table
>   'read.data.skipping.enabled' = 'true' -- enables data skipping
> );
> INSERT INTO t1 VALUES
>   ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
>   ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
>   ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
>   ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
>   ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
>   ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
>   ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
>   ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); {code}
> !Screen Shot 2022-08-04 at 17.10.05.png|width=1130,height=463!
> Exception:
> {code:java}
> 2022-08-04 17:04:41
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
>     at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
>     at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
>     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>     at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>     at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>     at 

[jira] [Updated] (HUDI-4457) Make sure IT docker test return code non-zero when failed

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4457:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Make sure IT docker test return code non-zero when failed
> -
>
> Key: HUDI-4457
> URL: https://issues.apache.org/jira/browse/HUDI-4457
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Raymond Xu
>Priority: Major
> Fix For: 0.14.0
>
>
> IT testcase where docker command runs and returns exit code 0, but test 
> actually failed. This will be misleading for troubleshooting.
> TODO
> 1. verify the behavior
> 2. fix it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4430) Incorrect type casting while reading HUDI table created with CustomKeyGenerator and unixtimestamp paritioning field

2023-05-22 Thread Yue Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-4430:

Fix Version/s: 0.14.0
   (was: 0.13.1)

> Incorrect type casting while reading HUDI table created with 
> CustomKeyGenerator and unixtimestamp paritioning field
> ---
>
> Key: HUDI-4430
> URL: https://issues.apache.org/jira/browse/HUDI-4430
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.0
>Reporter: Volodymyr Burenin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.14.0
>
>
> Hi,
> I have discovered an issue that doesn't play nicely with the custom key 
> generatosr, basically anything that is not TimestampBasedKeyGenerator or 
> TimestampBasedAvroKeyGenerator.
> {{While trying to read a table that was created with these parameters(the 
> rest don't matter):}}
> {code:java}
> hoodie.datasource.write.recordkey.field=query_id,event_type
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.partitionpath.field=create_time_epoch_seconds:timestamp
> hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd{code}
> {color:#172b4d}I get and error that looks like:{color}
> {code:java}
> 22/07/20 20:32:48 DEBUG Spark32HoodieParquetFileFormat: Appending 
> StructType(StructField(create_time_epoch_seconds,LongType,true)) [2022/07/13]
> 22/07/20 20:32:48 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Long
>     at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>     at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)
>     at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42)
>     at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195)
>     at 
> org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:66)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
>  {code}
> Apparently the issue is in _partitionSchemaFromProperties function in file: 
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
> that checks for a class type it uses StructType of String for.
> Once it is any non Timestamp based known class it basically uses whatever 
> type it is and then fails to retrieve the value for.
> I have a proposal here which we probably need: Give a user a way to force a 
> string type if needed and add ability to add a prefixed column that contains 
> a processed partition value. It could be done as two separate features.
> This problem is critical for me, so I have to change Hoodie source code on my 
> end temporary to make it work.
> Here is how I roughly changed the referenced function:
>  
> {code:java}
> /**
>  * Get the partition schema from the hoodie.properties.
>  */
> private lazy val _partitionSchemaFromProperties: StructType = {
>   val tableConfig = metaClient.getTableConfig
>   val partitionColumns = tableConfig.getPartitionFields
>   if (partitionColumns.isPresent) {
> val partitionFields = partitionColumns.get().map(column => 
> StructField("_hoodie_"+column, StringType))
> StructType(partitionFields)
>   } else {
> // If the partition columns have not stored in hoodie.properties(the 
> table that was
> // created earlier), we trait it as a non-partitioned table.
> logWarning("No partition columns available from hoodie.properties." +
>   " Partition pruning will not work")
> new StructType()
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   >