[jira] [Updated] (HUDI-7996) Store partition type with partition fields in table configs
[ https://issues.apache.org/jira/browse/HUDI-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7996: -- Summary: Store partition type with partition fields in table configs (was: Store partition field type in table configs) > Store partition type with partition fields in table configs > --- > > Key: HUDI-7996 > URL: https://issues.apache.org/jira/browse/HUDI-7996 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7996) Store partition field type in table configs
Lokesh Jain created HUDI-7996: - Summary: Store partition field type in table configs Key: HUDI-7996 URL: https://issues.apache.org/jira/browse/HUDI-7996 Project: Apache Hudi Issue Type: Sub-task Reporter: Lokesh Jain Assignee: Lokesh Jain -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7983) CDC query fails with ParanamerAnnotationIntrospector class not found
Lokesh Jain created HUDI-7983: - Summary: CDC query fails with ParanamerAnnotationIntrospector class not found Key: HUDI-7983 URL: https://issues.apache.org/jira/browse/HUDI-7983 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Upon trying out CDC query, following error is seen. java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/paranamer/ParanamerAnnotationIntrospector {code:java} scala> spark.read.option("hoodie.datasource.read.begin.instanttime", 0). | option("hoodie.datasource.query.type", "incremental"). | option("hoodie.datasource.query.incremental.format", "cdc"). | format("hudi").load(basePath).show(false) 24/07/12 16:16:49 ERROR Executor: Exception in task 0.0 in stage 127.0 (TID 227) java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/paranamer/ParanamerAnnotationIntrospector at org.apache.hudi.cdc.InternalRowToJsonStringConverter.mapper$lzycompute(InternalRowToJsonStringConverter.scala:36) at org.apache.hudi.cdc.InternalRowToJsonStringConverter.mapper(InternalRowToJsonStringConverter.scala:32) at org.apache.hudi.cdc.InternalRowToJsonStringConverter.convert(InternalRowToJsonStringConverter.scala:50) at org.apache.hudi.cdc.CDCFileGroupIterator.convertRowToJsonString(CDCFileGroupIterator.scala:515) at org.apache.hudi.cdc.CDCFileGroupIterator.loadNext(CDCFileGroupIterator.scala:250) at org.apache.hudi.cdc.CDCFileGroupIterator.hasNextInternal(CDCFileGroupIterator.scala:218) at org.apache.hudi.cdc.CDCFileGroupIterator.hasNext(CDCFileGroupIterator.scala:239) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: org.apache.hudi.com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 25 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is also stored in table config
[ https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7970: -- Summary: Add support to read partition fields when partition type is also stored in table config (was: Add support to read partition fields when partition type is stored in table config) > Add support to read partition fields when partition type is also stored in > table config > --- > > Key: HUDI-7970 > URL: https://issues.apache.org/jira/browse/HUDI-7970 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` > to also store partition type. This PR aims to make sure that the getter and > other functions accessing this field remain consistent in behaviour with the > new value type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is stored in table config
[ https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7970: -- Description: In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` to also store partition type. This PR aims to make sure that the getter and other functions accessing this field remain consistent in behaviour with the new value type. > Add support to read partition fields when partition type is stored in table > config > -- > > Key: HUDI-7970 > URL: https://issues.apache.org/jira/browse/HUDI-7970 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` > to also store partition type. This PR aims to make sure that the getter and > other functions accessing this field remain consistent in behaviour with the > new value type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7970) Add support to read partition fields when partition type is stored in table config
Lokesh Jain created HUDI-7970: - Summary: Add support to read partition fields when partition type is stored in table config Key: HUDI-7970 URL: https://issues.apache.org/jira/browse/HUDI-7970 Project: Apache Hudi Issue Type: Sub-task Reporter: Lokesh Jain Assignee: Lokesh Jain -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7902) Add a new table property to store partition field types for custom key generator
[ https://issues.apache.org/jira/browse/HUDI-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863211#comment-17863211 ] Lokesh Jain commented on HUDI-7902: --- The approach would be to reuse the key `hoodie.table.partition.fields` to store the key type for custom key generator. We will be breaking the changes into multiple parts here:- 1. Reader side change - We will ensure the reader is able to read the table in a backward compatible manner based on the table version. If table version is older then it should fall back to older method of using the key else the new method. This would also ensure that other users of the key are handled properly. 2. Write side change - Here we will ensure that the new values are reflected in the table config and updated in the hoodie.properties as part of upgrade handling. We will also need to handle downgrade for this change. 3. We need to update file index so that _partitionSchemaFromProperties handles the custom key generator and assigns string type for timestamp based partition types. > Add a new table property to store partition field types for custom key > generator > > > Key: HUDI-7902 > URL: https://issues.apache.org/jira/browse/HUDI-7902 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Lokesh Jain >Priority: Major > Fix For: 1.0.0 > > > As of today, writing a partitioned table with CustomKeyGenerator requires > write config of the partition field types (they are not stored as a table > property). CustomKeyGenerator requires partition keys to be in the format of > {{field:type}} (e.g. {{column1:SIMPLE}} ). However, only the field names are > stored in {{hoodie.properties}} file. We need to store the field types too > so that without the write config, the writer configure the correct partition > field name and type in Spark datasource and SQL DML writes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7956) Handle upgrade downgrade with cluster action type for pending clustering instants
Lokesh Jain created HUDI-7956: - Summary: Handle upgrade downgrade with cluster action type for pending clustering instants Key: HUDI-7956 URL: https://issues.apache.org/jira/browse/HUDI-7956 Project: Apache Hudi Issue Type: Sub-task Reporter: Lokesh Jain Assignee: Lokesh Jain https://issues.apache.org/jira/browse/HUDI-7905 adds a new cluster action type for all clustering pending instants. The completed instant still uses the replacecommit completed action. This jira aims to handle upgrade and downgrade of existing tables with this change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7905: -- Status: Patch Available (was: In Progress) > Use cluster action for clustering pending instants > -- > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action for requested and inflight > instant. This simplifies a few things such as we do not need to scan the > replacecommit.requested to determine whether we are looking at clustering > plan or not. This would simplify the usage of pending clustering related > APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7779: -- Status: Patch Available (was: In Progress) > Guarding archival to not archive unintended commits > --- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving >Reporter: sivabalan narayanan >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned up already, and then > only clean up the files which are supposed to be cleaned up. For eg, if a > file group has only one file slice, it should not be cleaned up and scenarios > like this. > > Option B: > Since archival is the one which might cause data consistency issues,
[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7905: -- Epic Link: HUDI-7856 > Use cluster action for clustering pending instants > -- > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action for requested and inflight > instant. This simplifies a few things such as we do not need to scan the > replacecommit.requested to determine whether we are looking at clustering > plan or not. This would simplify the usage of pending clustering related > APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7905: -- Description: Currently, we use replacecommit for clustering, insert overwrite and delete partition. Clustering should be a separate action for requested and inflight instant. This simplifies a few things such as we do not need to scan the replacecommit.requested to determine whether we are looking at clustering plan or not. This would simplify the usage of pending clustering related APIs. (was: Currently, we use replacecommit for clustering, insert overwrite and delete partition. Clustering should be a separate action. This simplifies a few things such as we do not need to scan the replacecommit.requested to determine whether we are looking at clustering plan or not. This also standardizes the usage of replacecommit to some extent (related to HUDI-1739).) > Use cluster action for clustering pending instants > -- > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action for requested and inflight > instant. This simplifies a few things such as we do not need to scan the > replacecommit.requested to determine whether we are looking at clustering > plan or not. This would simplify the usage of pending clustering related > APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7905: -- Summary: Use cluster action for clustering pending instants (was: New Action for Clustering) > Use cluster action for clustering pending instants > -- > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action. This simplifies a few > things such as we do not need to scan the replacecommit.requested to > determine whether we are looking at clustering plan or not. This also > standardizes the usage of replacecommit to some extent (related to HUDI-1739). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain closed HUDI-7877. - Resolution: Fixed > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7939) Validate file slices upto a commit in HoodieTableMetadataValidator
Lokesh Jain created HUDI-7939: - Summary: Validate file slices upto a commit in HoodieTableMetadataValidator Key: HUDI-7939 URL: https://issues.apache.org/jira/browse/HUDI-7939 Project: Apache Hudi Issue Type: Bug Components: metadata Reporter: Lokesh Jain Currently HoodieTableMetadataValidator validates all the available file slices and compares the metadata table fs view against filesystem based fs view. The jira aims to use last completed instant and query the views using this instant for comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7939) Validate file slices upto a commit in HoodieTableMetadataValidator
[ https://issues.apache.org/jira/browse/HUDI-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain reassigned HUDI-7939: - Assignee: Lokesh Jain > Validate file slices upto a commit in HoodieTableMetadataValidator > -- > > Key: HUDI-7939 > URL: https://issues.apache.org/jira/browse/HUDI-7939 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > Currently HoodieTableMetadataValidator validates all the available file > slices and compares the metadata table fs view against filesystem based fs > view. The jira aims to use last completed instant and query the views using > this instant for comparison. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7931) Initialize timeline for data table meta client while initializing HoodieBackedTableMetadata
[ https://issues.apache.org/jira/browse/HUDI-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain reassigned HUDI-7931: - Assignee: Lokesh Jain > Initialize timeline for data table meta client while initializing > HoodieBackedTableMetadata > --- > > Key: HUDI-7931 > URL: https://issues.apache.org/jira/browse/HUDI-7931 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > Currently while initialising HoodieBackedTableMetadata, the data table > timeline is not initialized by default whereas metadataMetaClient timeline is > initialized while creating metadataFileSystemView. In this jira we aim to > initialize timeline for dataMetaClient to ensure that while creating log > record scanner dataMetaClient timeline is in sync with metadataMetaClient > timeline. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7931) Initialize timeline for data table meta client while initializing HoodieBackedTableMetadata
Lokesh Jain created HUDI-7931: - Summary: Initialize timeline for data table meta client while initializing HoodieBackedTableMetadata Key: HUDI-7931 URL: https://issues.apache.org/jira/browse/HUDI-7931 Project: Apache Hudi Issue Type: Bug Components: metadata Reporter: Lokesh Jain Currently while initialising HoodieBackedTableMetadata, the data table timeline is not initialized by default whereas metadataMetaClient timeline is initialized while creating metadataFileSystemView. In this jira we aim to initialize timeline for dataMetaClient to ensure that while creating log record scanner dataMetaClient timeline is in sync with metadataMetaClient timeline. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7395) Fix computation for metrics in HoodieMetadataMetrics
[ https://issues.apache.org/jira/browse/HUDI-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-7395. --- > Fix computation for metrics in HoodieMetadataMetrics > > > Key: HUDI-7395 > URL: https://issues.apache.org/jira/browse/HUDI-7395 > Project: Apache Hudi > Issue Type: Bug > Components: metadata, metrics >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > For some of the metrics type like duration we are using incrementMetric > instead of setMetric. > Also some of the redundant metrics are removed. For example a count type > metric has both count and duration metric getting pushed even though duration > is not calculated. > File lookup count metric is added for bloom filter and column stat -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7877: -- Status: Patch Available (was: In Progress) > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping
[ https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7841: -- Status: Patch Available (was: In Progress) > RLI and secondary index should consider only pruned partitions for file > skipping > > > Key: HUDI-7841 > URL: https://issues.apache.org/jira/browse/HUDI-7841 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Even though RLI scans only matching files, it tries to get those candidate > files by iterating over all files from file index. See - > [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47] > Instead, it can use the `prunedPartitionsAndFileSlices` to only consider > pruned partitions whenever there is a partition predicate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload
[ https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7877: -- Status: In Progress (was: Open) > Add record position to record index metadata payload > > > Key: HUDI-7877 > URL: https://issues.apache.org/jira/browse/HUDI-7877 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > RLI should save the record position so that can be used in the index lookup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7905) New Action for Clustering
[ https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7905: -- Status: In Progress (was: Open) > New Action for Clustering > - > > Key: HUDI-7905 > URL: https://issues.apache.org/jira/browse/HUDI-7905 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 1.0.0 > > > Currently, we use replacecommit for clustering, insert overwrite and delete > partition. Clustering should be a separate action. This simplifies a few > things such as we do not need to scan the replacecommit.requested to > determine whether we are looking at clustering plan or not. This also > standardizes the usage of replacecommit to some extent (related to HUDI-1739). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping
[ https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7841: -- Status: In Progress (was: Open) > RLI and secondary index should consider only pruned partitions for file > skipping > > > Key: HUDI-7841 > URL: https://issues.apache.org/jira/browse/HUDI-7841 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Even though RLI scans only matching files, it tries to get those candidate > files by iterating over all files from file index. See - > [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47] > Instead, it can use the `prunedPartitionsAndFileSlices` to only consider > pruned partitions whenever there is a partition predicate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7719) Introduce capability to specify config value as a time duration
[ https://issues.apache.org/jira/browse/HUDI-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7719: -- Summary: Introduce capability to specify config value as a time duration (was: Introduce capability to specify ConfigProperty as a time duration) > Introduce capability to specify config value as a time duration > --- > > Key: HUDI-7719 > URL: https://issues.apache.org/jira/browse/HUDI-7719 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > > Currently the config values are mentioned in seconds or a fixed chrono unit. > We should also have support for specifying these config values as a time > duration like 5m, 60s for ease of use. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7719) Introduce capability to specify ConfigProperty as a time duration
Lokesh Jain created HUDI-7719: - Summary: Introduce capability to specify ConfigProperty as a time duration Key: HUDI-7719 URL: https://issues.apache.org/jira/browse/HUDI-7719 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Currently the config values are mentioned in seconds or a fixed chrono unit. We should also have support for specifying these config values as a time duration like 5m, 60s for ease of use. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7571) Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode
Lokesh Jain created HUDI-7571: - Summary: Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode Key: HUDI-7571 URL: https://issues.apache.org/jira/browse/HUDI-7571 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure and continues the validation. This jira aims to add api to get list of exceptions and an api to check if validation exception was thrown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7524) Ensure existing hoodie.properties are not overwritten with HoodieTableMetaClient creation
[ https://issues.apache.org/jira/browse/HUDI-7524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7524: -- Description: org.apache.hudi.common.table.HoodieTableMetaClient#initTableAndGetMetaClient can overwrite a existing `hoodie.properties` today. The jira ensures that an error is thrown if the file already exists. > Ensure existing hoodie.properties are not overwritten with > HoodieTableMetaClient creation > - > > Key: HUDI-7524 > URL: https://issues.apache.org/jira/browse/HUDI-7524 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > org.apache.hudi.common.table.HoodieTableMetaClient#initTableAndGetMetaClient > can overwrite a existing `hoodie.properties` today. The jira ensures that an > error is thrown if the file already exists. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7524) Ensure existing hoodie.properties are not overwritten with HoodieTableMetaClient creation
Lokesh Jain created HUDI-7524: - Summary: Ensure existing hoodie.properties are not overwritten with HoodieTableMetaClient creation Key: HUDI-7524 URL: https://issues.apache.org/jira/browse/HUDI-7524 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7395) Fix computation for metrics in HoodieMetadataMetrics
[ https://issues.apache.org/jira/browse/HUDI-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7395: -- Summary: Fix computation for metrics in HoodieMetadataMetrics (was: Fix naming and computation for metrics in HoodieMetadataMetrics) > Fix computation for metrics in HoodieMetadataMetrics > > > Key: HUDI-7395 > URL: https://issues.apache.org/jira/browse/HUDI-7395 > Project: Apache Hudi > Issue Type: Bug > Components: metadata, metrics >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > For some of the metrics type like duration we are using incrementMetric > instead of setMetric. > Also some of the redundant metrics are removed. For example a count type > metric has both count and duration metric getting pushed even though duration > is not calculated. > File lookup count metric is added for bloom filter and column stat -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7395) Fix naming and computation for metrics in HoodieMetadataMetrics
Lokesh Jain created HUDI-7395: - Summary: Fix naming and computation for metrics in HoodieMetadataMetrics Key: HUDI-7395 URL: https://issues.apache.org/jira/browse/HUDI-7395 Project: Apache Hudi Issue Type: Bug Components: metadata, metrics Reporter: Lokesh Jain Assignee: Lokesh Jain For some of the metrics type like duration we are using incrementMetric instead of setMetric. Also some of the redundant metrics are removed. For example a count type metric has both count and duration metric getting pushed even though duration is not calculated. File lookup count metric is added for bloom filter and column stat -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7391) HoodieMetadataMetrics should use Metrics instance for metrics registry
[ https://issues.apache.org/jira/browse/HUDI-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7391: -- Description: Currently HoodieMetadataMetrics stores metrics in memory and these metrics are not pushed by the metric reporters. The metric reporters are configured within Metrics instance. List of changes in the PR: 1. Metrics related classes have been moved from hudi-client-common to hudi-common. 2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can be supported with it. 3. Some gaps in configs which are added in HoodieMetadataWriteUtils 4. Some metrics related apis and functionality has been moved to HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to HoodieMetricsConfig for the functionality. was: Currently HoodieMetadataMetrics stores metrics in memory and these metrics are not pushed by the metric reporters. The metric reporters are configured within Metrics instance. List of changes in the PR: 1. Metrics related classes have been moved from hudi-client-common to hudi-common. 2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can be supported with it. 3. Some gaps in configs which are added in HoodieMetadataWriteUtils > HoodieMetadataMetrics should use Metrics instance for metrics registry > -- > > Key: HUDI-7391 > URL: https://issues.apache.org/jira/browse/HUDI-7391 > Project: Apache Hudi > Issue Type: Bug > Components: metadata, metrics >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > Currently HoodieMetadataMetrics stores metrics in memory and these metrics > are not pushed by the metric reporters. The metric reporters are configured > within Metrics instance. List of changes in the PR: > 1. Metrics related classes have been moved from hudi-client-common to > hudi-common. > 2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can > be supported with it. > 3. Some gaps in configs which are added in HoodieMetadataWriteUtils > 4. Some metrics related apis and functionality has been moved to > HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to > HoodieMetricsConfig for the functionality. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7391) HoodieMetadataMetrics should use Metrics instance for metrics registry
Lokesh Jain created HUDI-7391: - Summary: HoodieMetadataMetrics should use Metrics instance for metrics registry Key: HUDI-7391 URL: https://issues.apache.org/jira/browse/HUDI-7391 Project: Apache Hudi Issue Type: Bug Components: metadata, metrics Reporter: Lokesh Jain Assignee: Lokesh Jain Currently HoodieMetadataMetrics stores metrics in memory and these metrics are not pushed by the metric reporters. The metric reporters are configured within Metrics instance. List of changes in the PR: 1. Metrics related classes have been moved from hudi-client-common to hudi-common. 2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can be supported with it. 3. Some gaps in configs which are added in HoodieMetadataWriteUtils -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7120) Performance improvements in deltastreamer executor code path
[ https://issues.apache.org/jira/browse/HUDI-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-7120. --- > Performance improvements in deltastreamer executor code path > > > Key: HUDI-7120 > URL: https://issues.apache.org/jira/browse/HUDI-7120 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Makes improvements based on findings from CPU profiling for the executor code > path. > 1. Fixes repetitive execution of string split operation > 2. reduces number of validation calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7120) Performance improvements in deltastreamer executor code path
[ https://issues.apache.org/jira/browse/HUDI-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-7120: -- Fix Version/s: 0.14.1 > Performance improvements in deltastreamer executor code path > > > Key: HUDI-7120 > URL: https://issues.apache.org/jira/browse/HUDI-7120 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > Makes improvements based on findings from CPU profiling for the executor code > path. > 1. Fixes repetitive execution of string split operation > 2. reduces number of validation calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7120) Performance improvements in deltastreamer executor code path
Lokesh Jain created HUDI-7120: - Summary: Performance improvements in deltastreamer executor code path Key: HUDI-7120 URL: https://issues.apache.org/jira/browse/HUDI-7120 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Makes improvements based on findings from CPU profiling for the executor code path. 1. Fixes repetitive execution of string split operation 2. reduces number of validation calls -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6896) HoodieAvroHFileReader.RecordIterator iteration never terminates
[ https://issues.apache.org/jira/browse/HUDI-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain reassigned HUDI-6896: - Assignee: Lokesh Jain > HoodieAvroHFileReader.RecordIterator iteration never terminates > --- > > Key: HUDI-6896 > URL: https://issues.apache.org/jira/browse/HUDI-6896 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses > org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first > line of the file. > {code:java} > if (!scanner.isSeeked()) { > hasRecords = scanner.seekTo(); > } > {code} > if isSeeked returns false, scanner seeks to start of file. > After end of file is reached, isSeeked would still return false and the next > time hasNext is called it seeks to start of file again leading to an infinite > loop. > Documentation for HFileScanner#isSeeked > True is scanner has had one of the seek calls invoked; i.e. seekBefore(Cell) > or seekTo() or seekTo(Cell). Otherwise returns false. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6896) HoodieAvroHFileReader.RecordIterator iteration never terminates
[ https://issues.apache.org/jira/browse/HUDI-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6896: -- Description: org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first line of the file. {code:java} if (!scanner.isSeeked()) { hasRecords = scanner.seekTo(); } {code} if isSeeked returns false, scanner seeks to start of file. After end of file is reached, isSeeked would still return false and the next time hasNext is called it seeks to start of file again leading to an infinite loop. Documentation for HFileScanner#isSeeked True is scanner has had one of the seek calls invoked; i.e. seekBefore(Cell) or seekTo() or seekTo(Cell). Otherwise returns false. was:org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first line of the file. > HoodieAvroHFileReader.RecordIterator iteration never terminates > --- > > Key: HUDI-6896 > URL: https://issues.apache.org/jira/browse/HUDI-6896 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > > org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses > org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first > line of the file. > {code:java} > if (!scanner.isSeeked()) { > hasRecords = scanner.seekTo(); > } > {code} > if isSeeked returns false, scanner seeks to start of file. > After end of file is reached, isSeeked would still return false and the next > time hasNext is called it seeks to start of file again leading to an infinite > loop. > Documentation for HFileScanner#isSeeked > True is scanner has had one of the seek calls invoked; i.e. seekBefore(Cell) > or seekTo() or seekTo(Cell). Otherwise returns false. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6896) HoodieAvroHFileReader.RecordIterator iteration never terminates
Lokesh Jain created HUDI-6896: - Summary: HoodieAvroHFileReader.RecordIterator iteration never terminates Key: HUDI-6896 URL: https://issues.apache.org/jira/browse/HUDI-6896 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first line of the file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6890) Fail fast or auto determine record type during read
Lokesh Jain created HUDI-6890: - Summary: Fail fast or auto determine record type during read Key: HUDI-6890 URL: https://issues.apache.org/jira/browse/HUDI-6890 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain HUDI-6753 fixes some tests where read was failing because the record merger implementation config was not set during read(HoodieWriteConfig.RECORD_MERGER_IMPLS.key() -> HoodieSparkRecordMerger.class.getName()). The failure was midway during parquet read. Please check the mentioned jira for more details on the stacktrace. This jira aims to make a change where either the read fails fast or the merger implementation is auto determined during read. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6889) Improve usability of bulk insert with insert overwrite operations in Spark Datasource
Lokesh Jain created HUDI-6889: - Summary: Improve usability of bulk insert with insert overwrite operations in Spark Datasource Key: HUDI-6889 URL: https://issues.apache.org/jira/browse/HUDI-6889 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Currently for using bulk insert with insert overwrite operations in Spark Datasource, user would currently have to set `hoodie.bulkinsert.overwrite.operation.type` to insert_overwrite_table(OVERWRITE) or insert_overwrite(APPEND) and use the corresponding save modes. Since this is an internal config, it should not be exposed. The jira aims to find an easier way to support the feature for the users through a new config or a different config altogether. One idea is to deprecate hoodie.spark.sql.insert.into.operation and create a new config which can be shared by both sql and datasource. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6887) Add test for Record Index and MIT queries
Lokesh Jain created HUDI-6887: - Summary: Add test for Record Index and MIT queries Key: HUDI-6887 URL: https://issues.apache.org/jira/browse/HUDI-6887 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain The jira creates test for RLI and MIT queries with checks for validating that data skipping is occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764274#comment-17764274 ] Lokesh Jain commented on HUDI-6820: --- After enabling more debug logs, created a few PRs. https://github.com/apache/hudi/actions/runs/6158416987/job/16711145162?pr=9690 . I can see that all annotations with @After complete before timeout > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764273#comment-17764273 ] Lokesh Jain commented on HUDI-6820: --- | Created PRs [https://github.com/apache/hudi/pull/9676] to 9678 after disabling TestHoodieCombineHiveInputFormat Usually I see timeout in hudi-java-client after disabling this test. Otherwise we see timeout in both hadoop-mr and java-client. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763849#comment-17763849 ] Lokesh Jain commented on HUDI-6820: --- Created [https://github.com/apache/hudi/pull/9683] after enabling debug logs for maven. Only the newly added github check would run here. Should ideally print out more logs. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763846#comment-17763846 ] Lokesh Jain commented on HUDI-6820: --- Created [https://github.com/apache/hudi/pull/9682] to add 40 minutes timeout for hudi-hadoop-mr check. The github check takes 6 hours to timeout currently. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763838#comment-17763838 ] Lokesh Jain commented on HUDI-6820: --- Created PRs https://github.com/apache/hudi/pull/9676 to 9678 after disabling TestHoodieCombineHiveInputFormat. Was seeing flakiness in github check for separated hudi-mr and java-client module. Out of 3 prs, 2 passed but 1 will timeout. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6833) Add field for tracking log files from failed commit in rollback metadata
[ https://issues.apache.org/jira/browse/HUDI-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain closed HUDI-6833. - Fix Version/s: 0.14.0 Resolution: Fixed > Add field for tracking log files from failed commit in rollback metadata > > > Key: HUDI-6833 > URL: https://issues.apache.org/jira/browse/HUDI-6833 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > The Jira aims to add field for tracking log files from failed commit in the > rollback metadata. The fix for using this field would be tracked in HUDI-6761. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763180#comment-17763180 ] Lokesh Jain commented on HUDI-6820: --- Also created PRs [https://github.com/apache/hudi/pull/9654/files] to 9657 for moving hudi-hadoop-mr and hudi-java-client to a separate job. Timeout errors are still seen in the UT FT jobs. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763174#comment-17763174 ] Lokesh Jain commented on HUDI-6820: --- Created a PR to move hudi-hadoop-mr and hudi-java-client to github action. [https://github.com/apache/hudi/pull/9661] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763175#comment-17763175 ] Lokesh Jain commented on HUDI-6820: --- Created a PR which disables deltastreamer continuous mode tests. [https://github.com/apache/hudi/pull/9662] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763171#comment-17763171 ] Lokesh Jain commented on HUDI-6820: --- PRs [https://github.com/apache/hudi/pull/9604] to 9607 had been run on 0.13.0 branch. Out of total 8 PRs, 4 failed with timeout and 4 others were cancelled because of other critical PRs. The issue is reproducible with 0.13.0. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6833) Add field for tracking log files from failed commit in rollback metadata
[ https://issues.apache.org/jira/browse/HUDI-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6833: -- Description: The Jira aims to add field for tracking log files from failed commit in the rollback metadata. The fix for using this field would be tracked in HUDI-6761. was:The Jira aims to add field for tracking log files from failed commit in the rollback metadata. The fix for using this field would be tracked in HUDI-6758. > Add field for tracking log files from failed commit in rollback metadata > > > Key: HUDI-6833 > URL: https://issues.apache.org/jira/browse/HUDI-6833 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > The Jira aims to add field for tracking log files from failed commit in the > rollback metadata. The fix for using this field would be tracked in HUDI-6761. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6833) Add field for tracking log files from failed commit in rollback metadata
Lokesh Jain created HUDI-6833: - Summary: Add field for tracking log files from failed commit in rollback metadata Key: HUDI-6833 URL: https://issues.apache.org/jira/browse/HUDI-6833 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain The Jira aims to add field for tracking log files from failed commit in the rollback metadata. The fix for using this field would be tracked in HUDI-6758. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762835#comment-17762835 ] Lokesh Jain edited comment on HUDI-6820 at 9/7/23 6:31 PM: --- First known occurrence of timeout issue (12th June) - [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17760=results] was (Author: ljain): First known occurrence of timeout issue (28th June) - [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=results] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762835#comment-17762835 ] Lokesh Jain commented on HUDI-6820: --- First known occurrence of timeout issue (28th June) - [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=results] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762734#comment-17762734 ] Lokesh Jain commented on HUDI-6820: --- The maven version is 3.8.8 and has not changed since July. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6830) Fix downgrade from version six for partially failed commits
Lokesh Jain created HUDI-6830: - Summary: Fix downgrade from version six for partially failed commits Key: HUDI-6830 URL: https://issues.apache.org/jira/browse/HUDI-6830 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain With the new version six, if table has pending commits then these commits should be rolled back during downgrade so that files created using the new format are cleaned up properly. The Jira aims to fix the downgrade handler to support this step. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762470#comment-17762470 ] Lokesh Jain commented on HUDI-6820: --- Created [https://github.com/apache/hudi/pull/9604] , PRs 9605-9609, [https://github.com/apache/hudi/pull/9632] and [https://github.com/apache/hudi/pull/9633] for testing Azure CI on 0.13.0 hudi branch. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762467#comment-17762467 ] Lokesh Jain commented on HUDI-6820: --- Created [https://github.com/apache/hudi/pull/962|https://github.com/apache/hudi/pull/9627]7 and [https://github.com/apache/hudi/pull/9628] with disabled {{TestHoodieRealtimeRecordReader}} . UT FT timed out after 4 hours in both the PRs above. Also FT client/spark-client also timed out in the CI run [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19671=logs=7601efb9-4019-552e-11ba-eb31b66593b2=9688f101-287d-53f4-2a80-87202516f5d0] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762465#comment-17762465 ] Lokesh Jain commented on HUDI-6820: --- Found a microsoft developer ticket for similar issue where Azure CI is getting timeout. [https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762465#comment-17762465 ] Lokesh Jain edited comment on HUDI-6820 at 9/6/23 5:29 PM: --- Found a microsoft developer ticket for similar issue where Azure CI is getting timeout. [https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all] was (Author: ljain): Found a microsoft developer ticket for similar issue where Azure CI is getting timeout. [https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762464#comment-17762464 ] Lokesh Jain commented on HUDI-6820: --- Usually the error message is of the form: {code:java} ,##[error]We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610 {code} Or of the form: {code:java} The job running on agent Azure Pipelines 7 ran longer than the maximum time of 150 minutes. {code} It could be an actual timeout where tests were running for 150 minutes. But in many cases issue we are seeing is that raw logs show {code:java} 2023-07-28T06:01:14.7898970Zat java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_372] 2023-07-28T06:01:14.7899483Zat org.apache.hudi.metrics.Metrics.shutdown(Metrics.java:116) ~[hudi-client-common-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT] 2023-07-28T06:01:14.7899862Zat java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372] 2023-07-28T07:27:41.0195722Z ##[error]The operation was canceled. 2023-07-28T07:27:41.0242402Z ##[section]Finishing: FT client/spark-client {code} There is a gap of 1 hr and more between last test run and the operation cancellation. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762463#comment-17762463 ] Lokesh Jain edited comment on HUDI-6820 at 9/6/23 5:24 PM: --- Some older runs where similar issue can be seen: FT client/spark-client: [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18864=logs=7601efb9-4019-552e-11ba-eb31b66593b2|https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0] UT FT other modules [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0] was (Author: ljain): Some older runs where similar issue can be seen: FT client/spark-client: [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0] UT FT other modules [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762463#comment-17762463 ] Lokesh Jain commented on HUDI-6820: --- Some older runs where similar issue can be seen: FT client/spark-client: [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0] UT FT other modules [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0] > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762462#comment-17762462 ] Lokesh Jain commented on HUDI-6820: --- For some runs like - [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19384=logs=ba200224-5437-5e21-2643-114ac65587f4] attempt 2 and 3 here, the timeout occurs after we see 87 tests have run. For some of the others like [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19500/logs/7], timeout occurs after we see 15 tests have run. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
[ https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762461#comment-17762461 ] Lokesh Jain commented on HUDI-6820: --- Siva had tried disabling tests to see if they were causing timeouts via PRs - [https://github.com/apache/hudi/pull/9543] [https://github.com/apache/hudi/pull/9550/files] [https://github.com/apache/hudi/pull/9542] [https://github.com/apache/hudi/pull/9551] But the timeout issue was still visible. > Fix Azure CI timeout for UT FT other modules > > > Key: HUDI-6820 > URL: https://issues.apache.org/jira/browse/HUDI-6820 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6820) Fix Azure CI timeout for UT FT other modules
Lokesh Jain created HUDI-6820: - Summary: Fix Azure CI timeout for UT FT other modules Key: HUDI-6820 URL: https://issues.apache.org/jira/browse/HUDI-6820 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6753) Fix parquet inline reading flaky test
[ https://issues.apache.org/jira/browse/HUDI-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain reassigned HUDI-6753: - Assignee: Lokesh Jain > Fix parquet inline reading flaky test > - > > Key: HUDI-6753 > URL: https://issues.apache.org/jira/browse/HUDI-6753 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Assignee: Lokesh Jain >Priority: Major > > Sometimes we see some flakiness around parquet inline reading. > > Ref: > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19457/logs/8] > > > {code:java} > 2023-08-25T05:00:14.1359469Z 1389627 [Executor task launch worker for task > 1.0 in stage 4124.0 (TID 5621)] ERROR > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got > exception when reading log file > 2023-08-25T05:00:14.1360427Z org.apache.hudi.exception.HoodieException: > unable to read next record from parquet file > 2023-08-25T05:00:14.1361525Z at > org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1362403Z at > org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1363340Z at > org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1364854Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:625) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1365985Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:667) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1367473Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:362) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1368371Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1369127Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:201) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1369901Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:117) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1370633Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:76) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1371380Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:466) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1372312Z at > org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:371) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1372915Z at > org.apache.hudi.LogFileIterator.(Iterators.scala:110) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1373549Z at > org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:201) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1374172Z at > org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:212) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1374809Z at > org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:217) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1375480Z at > org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1376156Z at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > ~[spark-core_2.12-3.2.3.jar:3.2.3] > 2023-08-25T05:00:14.1376653Z at > org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > ~[spark-core_2.12-3.2.3.jar:3.2.3] > 2023-08-25T05:00:14.1377283Z at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > ~[spark-core_2.12-3.2.3.jar:3.2.3] > 2023-08-25T05:00:14.1377837Z at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > ~[spark-core_2.12-3.2.3.jar:3.2.3] > 2023-08-25T05:00:14.1378323Z at > org.apache.spark.rdd.RDD.iterator(RDD.scala:337) >
[jira] [Commented] (HUDI-6753) Fix parquet inline reading flaky test
[ https://issues.apache.org/jira/browse/HUDI-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761435#comment-17761435 ] Lokesh Jain commented on HUDI-6753: --- `org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport#init` creates a new parquet schema after converting the struct ype fields for (DECIMAL(10,6) but org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport#init doesn't do a similar coversion while reading leading to read error. > Fix parquet inline reading flaky test > - > > Key: HUDI-6753 > URL: https://issues.apache.org/jira/browse/HUDI-6753 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > > Sometimes we see some flakiness around parquet inline reading. > > Ref: > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19457/logs/8] > > > {code:java} > 2023-08-25T05:00:14.1359469Z 1389627 [Executor task launch worker for task > 1.0 in stage 4124.0 (TID 5621)] ERROR > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got > exception when reading log file > 2023-08-25T05:00:14.1360427Z org.apache.hudi.exception.HoodieException: > unable to read next record from parquet file > 2023-08-25T05:00:14.1361525Z at > org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1362403Z at > org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1363340Z at > org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1364854Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:625) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1365985Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:667) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1367473Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:362) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1368371Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1369127Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:201) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1369901Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:117) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1370633Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:76) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1371380Z at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:466) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1372312Z at > org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:371) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1372915Z at > org.apache.hudi.LogFileIterator.(Iterators.scala:110) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1373549Z at > org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:201) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1374172Z at > org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:212) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1374809Z at > org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:217) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1375480Z at > org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109) > ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1376156Z at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > ~[spark-core_2.12-3.2.3.jar:3.2.3] > 2023-08-25T05:00:14.1376653Z at > org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > ~[spark-core_2.12-3.2.3.jar:3.2.3] > 2023-08-25T05:00:14.1377283Z at >
[jira] [Commented] (HUDI-6753) Fix parquet inline reading flaky test
[ https://issues.apache.org/jira/browse/HUDI-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761294#comment-17761294 ] Lokesh Jain commented on HUDI-6753: --- The stacktrace is always seen for the test even when test passes. The stacktrace shows that the file schema and requested schema differs while reading parquet. FileSchema is the schema from parquet file and requested schema is schema during read. File schema: {code:java} message spark_schema { optional binary _hoodie_commit_time (STRING); optional binary _hoodie_commit_seqno (STRING); optional binary _hoodie_record_key (STRING); optional binary _hoodie_partition_path (STRING); optional binary _hoodie_file_name (STRING); required int64 timestamp; required binary _row_key (STRING); required binary partition_path (STRING); required binary trip_type (STRING); required binary rider (STRING); required binary driver (STRING); required double begin_lat; required double begin_lon; required double end_lat; required double end_lon; required int32 distance_in_meters; required int64 seconds_since_epoch; required float weight; required binary nation; required int32 current_date (DATE); required int64 current_ts; required int64 height (DECIMAL(10,6)); required group city_to_state (MAP) { repeated group key_value { required binary key (STRING); required binary value (STRING); } } required group fare { required double amount; required binary currency (STRING); } required group tip_history (LIST) { repeated group list { required group element { required double amount; required binary currency (STRING); } } } required boolean _hoodie_is_deleted; required double haversine_distance; }{code} Requested schema: {code:java} message triprec { optional binary _hoodie_commit_time (STRING); optional binary _hoodie_commit_seqno (STRING); optional binary _hoodie_record_key (STRING); optional binary _hoodie_partition_path (STRING); optional binary _hoodie_file_name (STRING); required int64 timestamp; required binary _row_key (STRING); required binary partition_path (STRING); required binary trip_type (STRING); required binary rider (STRING); required binary driver (STRING); required double begin_lat; required double begin_lon; required double end_lat; required double end_lon; required int32 distance_in_meters; required int64 seconds_since_epoch; required float weight; required binary nation; required int32 current_date (DATE); required int64 current_ts; required fixed_len_byte_array(5) height (DECIMAL(10,6)); required group city_to_state (MAP) { repeated group key_value { required binary key (STRING); required binary value (STRING); } } required group fare { required double amount; required binary currency (STRING); } required group tip_history (LIST) { repeated group list { required group element { required double amount; required binary currency (STRING); } } } required boolean _hoodie_is_deleted; required double haversine_distance; } {code} > Fix parquet inline reading flaky test > - > > Key: HUDI-6753 > URL: https://issues.apache.org/jira/browse/HUDI-6753 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > > Sometimes we see some flakiness around parquet inline reading. > > Ref: > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19457/logs/8] > > > {code:java} > 2023-08-25T05:00:14.1359469Z 1389627 [Executor task launch worker for task > 1.0 in stage 4124.0 (TID 5621)] ERROR > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got > exception when reading log file > 2023-08-25T05:00:14.1360427Z org.apache.hudi.exception.HoodieException: > unable to read next record from parquet file > 2023-08-25T05:00:14.1361525Z at > org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1362403Z at > org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1363340Z at > org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1364854Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:625) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1365985Z at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:667) > ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT] > 2023-08-25T05:00:14.1367473Z at >
[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks
[ https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6621: -- Description: In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform compaction to handle v3 delete blocks created using the new format. Also with the addition of record index field in Metadata table schema, the downgrade needs to delete the metadata table to avoid column drop errors after downgrade. was:In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform compaction to handle v3 delete blocks created using the new format. > Add a downgrade step from 6 to 5 to detect new delete blocks > > > Key: HUDI-6621 > URL: https://issues.apache.org/jira/browse/HUDI-6621 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > In table version 6, we introduce a new delete block format (v3) with Avro > serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform > compaction to handle v3 delete blocks created using the new format. > Also with the addition of record index field in Metadata table schema, the > downgrade needs to delete the metadata table to avoid column drop errors > after downgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks
[ https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6621: -- Description: In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform compaction to handle v3 delete blocks created using the new format. (was: In table version 6, we introduce a new delete block format (v3) with Avro serde (HUDI-5760). For downgrading a table from v6 to v5, we need to check any v3 delete blocks using the new format and ask user to manually restore to a commit before any file slice with a v3 delete block.) > Add a downgrade step from 6 to 5 to detect new delete blocks > > > Key: HUDI-6621 > URL: https://issues.apache.org/jira/browse/HUDI-6621 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.14.0 > > > In table version 6, we introduce a new delete block format (v3) with Avro > serde (HUDI-5760). For downgrading a table from v6 to v5, we need to perform > compaction to handle v3 delete blocks created using the new format. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6717) Fix downgrade handler for 0.14.0
[ https://issues.apache.org/jira/browse/HUDI-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain closed HUDI-6717. - Resolution: Duplicate > Fix downgrade handler for 0.14.0 > > > Key: HUDI-6717 > URL: https://issues.apache.org/jira/browse/HUDI-6717 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Since the log block version (due to delete block change) has been upgraded in > 0.14.0, the delete blocks can not be read in 0.13.0 or earlier. > Similarly the addition of record level index field in metadata table leads to > column drop error on downgrade. The Jira aims to fix the downgrade handler to > trigger compaction and delete metadata table if user wishes to downgrade from > version six (0.14.0) to version 5 (0.13.0). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6726) Fix connection leaks related to file reader and iterator close
Lokesh Jain created HUDI-6726: - Summary: Fix connection leaks related to file reader and iterator close Key: HUDI-6726 URL: https://issues.apache.org/jira/browse/HUDI-6726 Project: Apache Hudi Issue Type: Bug Components: metadata, reader-core Reporter: Lokesh Jain The Jira aims to fix connection leaks caused due to non closure of file readers and iterators used to iterate the records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6717) Fix downgrade handler for 0.14.0
Lokesh Jain created HUDI-6717: - Summary: Fix downgrade handler for 0.14.0 Key: HUDI-6717 URL: https://issues.apache.org/jira/browse/HUDI-6717 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain Since the log block version (due to delete block change) has been upgraded in 0.14.0, the delete blocks can not be read in 0.13.0 or earlier. Similarly the addition of record level index field in metadata table leads to column drop error on downgrade. The Jira aims to fix the downgrade handler to trigger compaction and delete metadata table if user wishes to downgrade from version six (0.14.0) to version 5 (0.13.0). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6677) Make HoodieRecordIndexInfo schema compatible with older versions
[ https://issues.apache.org/jira/browse/HUDI-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain closed HUDI-6677. - Resolution: Not A Problem > Make HoodieRecordIndexInfo schema compatible with older versions > > > Key: HUDI-6677 > URL: https://issues.apache.org/jira/browse/HUDI-6677 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently the metadata payload schema for record index can cause schema > evolution issues for existing hudi tables. The Jira aims to fix these issues. > There are two schema evolution issues -: > 1. The field name has changed from partition to partitionName. > 2. Also we have added a new field fileId in between a nested schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6677) Make HoodieRecordIndexInfo schema compatible with older versions
Lokesh Jain created HUDI-6677: - Summary: Make HoodieRecordIndexInfo schema compatible with older versions Key: HUDI-6677 URL: https://issues.apache.org/jira/browse/HUDI-6677 Project: Apache Hudi Issue Type: Bug Components: metadata Reporter: Lokesh Jain Currently the metadata payload schema for record index can cause schema evolution issues for existing hudi tables. The Jira aims to fix these issues. There are two schema evolution issues -: 1. The field name has changed from partition to partitionName. 2. Also we have added a new field fileId in between a nested schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6459) Add Rollback and other tests for Record Level Index
[ https://issues.apache.org/jira/browse/HUDI-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-6459. --- > Add Rollback and other tests for Record Level Index > --- > > Key: HUDI-6459 > URL: https://issues.apache.org/jira/browse/HUDI-6459 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > The Jira aims to add validation for rollback with record level index. The > validation is added in TestRecordLevelIndex test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6393) Add more RLI tests and fix HoodieTestTable accordingly
[ https://issues.apache.org/jira/browse/HUDI-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6393: -- Fix Version/s: 0.14.0 > Add more RLI tests and fix HoodieTestTable accordingly > -- > > Key: HUDI-6393 > URL: https://issues.apache.org/jira/browse/HUDI-6393 > Project: Apache Hudi > Issue Type: Test >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > https://github.com/apache/hudi/pull/8758#discussion_r1213866286 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6393) Add more RLI tests and fix HoodieTestTable accordingly
[ https://issues.apache.org/jira/browse/HUDI-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-6393. --- > Add more RLI tests and fix HoodieTestTable accordingly > -- > > Key: HUDI-6393 > URL: https://issues.apache.org/jira/browse/HUDI-6393 > Project: Apache Hudi > Issue Type: Test >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/pull/8758#discussion_r1213866286 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6459) Add Rollback and other tests for Record Level Index
[ https://issues.apache.org/jira/browse/HUDI-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6459: -- Fix Version/s: 0.14.0 > Add Rollback and other tests for Record Level Index > --- > > Key: HUDI-6459 > URL: https://issues.apache.org/jira/browse/HUDI-6459 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > The Jira aims to add validation for rollback with record level index. The > validation is added in TestRecordLevelIndex test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled
[ https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-6660. --- > For merge into use primary key constraint when optimized writes are enabled > --- > > Key: HUDI-6660 > URL: https://issues.apache.org/jira/browse/HUDI-6660 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently merge into fails when primary key is a complex key and join > conditions do not include all the primary key columns. The Jira aims to > restrict the constraint only when optimized writes are enabled. With the > optimized writes disabled, merge into doesn't update the records if the > primary key does not match completely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled
[ https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6660: -- Fix Version/s: 0.14.0 > For merge into use primary key constraint when optimized writes are enabled > --- > > Key: HUDI-6660 > URL: https://issues.apache.org/jira/browse/HUDI-6660 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently merge into fails when primary key is a complex key and join > conditions do not include all the primary key columns. The Jira aims to > restrict the constraint only when optimized writes are enabled. With the > optimized writes disabled, merge into doesn't update the records if the > primary key does not match completely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled
[ https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6660: -- Description: Currently merge into fails when primary key is a complex key and join conditions do not include all the primary key columns. The Jira aims to restrict the constraint only when optimized writes are enabled. With the optimized writes disabled, merge into doesn't update the records if the primary key does not match completely. (was: Currently merge into fails when primary key is a complex key and join conditions do not include all the primary key columns. The Jira aims to relax this requirement to allow join even on a subset of primary key columns.) > For merge into use primary key constraint when optimized writes are enabled > --- > > Key: HUDI-6660 > URL: https://issues.apache.org/jira/browse/HUDI-6660 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently merge into fails when primary key is a complex key and join > conditions do not include all the primary key columns. The Jira aims to > restrict the constraint only when optimized writes are enabled. With the > optimized writes disabled, merge into doesn't update the records if the > primary key does not match completely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled
[ https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6660: -- Summary: For merge into use primary key constraint when optimized writes are enabled (was: Primary key constraint should be applicable when optimized writes are enabled) > For merge into use primary key constraint when optimized writes are enabled > --- > > Key: HUDI-6660 > URL: https://issues.apache.org/jira/browse/HUDI-6660 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently merge into fails when primary key is a complex key and join > conditions do not include all the primary key columns. The Jira aims to relax > this requirement to allow join even on a subset of primary key columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6660) Primary key constraint should be applicable when optimized writes are enabled
[ https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6660: -- Summary: Primary key constraint should be applicable when optimized writes are enabled (was: Relax primary key constraint for merge into join condition) > Primary key constraint should be applicable when optimized writes are enabled > - > > Key: HUDI-6660 > URL: https://issues.apache.org/jira/browse/HUDI-6660 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently merge into fails when primary key is a complex key and join > conditions do not include all the primary key columns. The Jira aims to relax > this requirement to allow join even on a subset of primary key columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6660) Relax primary key constraint for merge into join condition
Lokesh Jain created HUDI-6660: - Summary: Relax primary key constraint for merge into join condition Key: HUDI-6660 URL: https://issues.apache.org/jira/browse/HUDI-6660 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: Lokesh Jain Currently merge into fails when primary key is a complex key and join conditions do not include all the primary key columns. The Jira aims to relax this requirement to allow join even on a subset of primary key columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6606) Use record level index with SQL equality queries
[ https://issues.apache.org/jira/browse/HUDI-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-6606. --- > Use record level index with SQL equality queries > > > Key: HUDI-6606 > URL: https://issues.apache.org/jira/browse/HUDI-6606 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > With Record level support in HUDI, the sql queries related to record keys can > leverage the Record Index. The Jira aims to add support for equality matches > with record keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6606) Use record level index with SQL equality queries
[ https://issues.apache.org/jira/browse/HUDI-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6606: -- Fix Version/s: 0.14.0 > Use record level index with SQL equality queries > > > Key: HUDI-6606 > URL: https://issues.apache.org/jira/browse/HUDI-6606 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > With Record level support in HUDI, the sql queries related to record keys can > leverage the Record Index. The Jira aims to add support for equality matches > with record keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6651) Support IN SQL query with Record Index
[ https://issues.apache.org/jira/browse/HUDI-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved HUDI-6651. --- > Support IN SQL query with Record Index > -- > > Key: HUDI-6651 > URL: https://issues.apache.org/jira/browse/HUDI-6651 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Currently Record Index based pruning is only supported for EqualTo queries on > a record key. This Jira aims to add support for IN query as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6651) Support IN SQL query with Record Index
[ https://issues.apache.org/jira/browse/HUDI-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6651: -- Fix Version/s: 0.14.0 > Support IN SQL query with Record Index > -- > > Key: HUDI-6651 > URL: https://issues.apache.org/jira/browse/HUDI-6651 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Currently Record Index based pruning is only supported for EqualTo queries on > a record key. This Jira aims to add support for IN query as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6657) Investigate records returned when full table scan is enabled for incremental query
Lokesh Jain created HUDI-6657: - Summary: Investigate records returned when full table scan is enabled for incremental query Key: HUDI-6657 URL: https://issues.apache.org/jira/browse/HUDI-6657 Project: Apache Hudi Issue Type: Bug Components: incremental-query Reporter: Lokesh Jain HUDI-6649 adds assertion for SQL queries with column stat index enabled. One of the assertions in the test class `org.apache.hudi.functional.TestColumnStatsIndexWithSQL#verifyFileIndexAndSQLQueries` is failing when full table scan is enabled. If the full table scan is disabled, the incremental query provides the correct results. The Jira aims to investigate and fix the query results. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6656) SQL query should also consider partition filters while querying column stats
Lokesh Jain created HUDI-6656: - Summary: SQL query should also consider partition filters while querying column stats Key: HUDI-6656 URL: https://issues.apache.org/jira/browse/HUDI-6656 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Currently only data filters are used while querying the required file slices from columns stats. The Jira aims to also include partition filters while querying columns stats. Currently files from pruned partitions will also be included after querying column stats. Although those files will be filtered out in the next steps. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6651) Support IN SQL query with Record Index
Lokesh Jain created HUDI-6651: - Summary: Support IN SQL query with Record Index Key: HUDI-6651 URL: https://issues.apache.org/jira/browse/HUDI-6651 Project: Apache Hudi Issue Type: Bug Components: index Reporter: Lokesh Jain Assignee: Lokesh Jain Currently Record Index based pruning is only supported for EqualTo queries on a record key. This Jira aims to add support for IN query as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6649) Fix column stat based data filtering for MOR
[ https://issues.apache.org/jira/browse/HUDI-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6649: -- Description: Currently MOR snapshot relation does not use the column stats index for pruning the files in its queries. The Jira aims to add support for pruning the file slices based on column stats in case of MOR. was:Currently MOR snapshot and incremental relation does not use the column stats index for pruning the files in its queries. The Jira aims to add support for pruning the file slices based on column stats in case of MOR. > Fix column stat based data filtering for MOR > > > Key: HUDI-6649 > URL: https://issues.apache.org/jira/browse/HUDI-6649 > Project: Apache Hudi > Issue Type: Bug > Components: index, writer-core >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > Currently MOR snapshot relation does not use the column stats index for > pruning the files in its queries. The Jira aims to add support for pruning > the file slices based on column stats in case of MOR. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6649) Fix column stat based data filtering for MOR
Lokesh Jain created HUDI-6649: - Summary: Fix column stat based data filtering for MOR Key: HUDI-6649 URL: https://issues.apache.org/jira/browse/HUDI-6649 Project: Apache Hudi Issue Type: Bug Components: index, writer-core Reporter: Lokesh Jain Assignee: Lokesh Jain Currently MOR snapshot and incremental relation does not use the column stats index for pruning the files in its queries. The Jira aims to add support for pruning the file slices based on column stats. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6649) Fix column stat based data filtering for MOR
[ https://issues.apache.org/jira/browse/HUDI-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6649: -- Description: Currently MOR snapshot and incremental relation does not use the column stats index for pruning the files in its queries. The Jira aims to add support for pruning the file slices based on column stats in case of MOR. (was: Currently MOR snapshot and incremental relation does not use the column stats index for pruning the files in its queries. The Jira aims to add support for pruning the file slices based on column stats.) > Fix column stat based data filtering for MOR > > > Key: HUDI-6649 > URL: https://issues.apache.org/jira/browse/HUDI-6649 > Project: Apache Hudi > Issue Type: Bug > Components: index, writer-core >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > Currently MOR snapshot and incremental relation does not use the column stats > index for pruning the files in its queries. The Jira aims to add support for > pruning the file slices based on column stats in case of MOR. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6631) RLI integration with SQL queries followup
Lokesh Jain created HUDI-6631: - Summary: RLI integration with SQL queries followup Key: HUDI-6631 URL: https://issues.apache.org/jira/browse/HUDI-6631 Project: Apache Hudi Issue Type: Bug Affects Versions: 1.0.0 Reporter: Lokesh Jain Assignee: Lokesh Jain HUDI-6606 adds support for EqualTo queries for simple record index. This Jira aims to add further support for following use cases:- 1. Query on multiple columns or integrating queries on different indices like column stats and RLI. 2. Support other key generator types for RLI. HUDI-6606 limits to simple record key. 3. Support other query types like IN, >, < etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6606) Use record level index with SQL equality queries
[ https://issues.apache.org/jira/browse/HUDI-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6606: -- Issue Type: Improvement (was: Bug) > Use record level index with SQL equality queries > > > Key: HUDI-6606 > URL: https://issues.apache.org/jira/browse/HUDI-6606 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > With Record level support in HUDI, the sql queries related to record keys can > leverage the Record Index. The Jira aims to add support for equality matches > with record keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)