[jira] [Updated] (HUDI-7996) Store partition type with partition fields in table configs

2024-07-16 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7996:
--
Summary: Store partition type with partition fields in table configs  (was: 
Store partition field type in table configs)

> Store partition type with partition fields in table configs
> ---
>
> Key: HUDI-7996
> URL: https://issues.apache.org/jira/browse/HUDI-7996
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7996) Store partition field type in table configs

2024-07-16 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7996:
-

 Summary: Store partition field type in table configs
 Key: HUDI-7996
 URL: https://issues.apache.org/jira/browse/HUDI-7996
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lokesh Jain
Assignee: Lokesh Jain






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7983) CDC query fails with ParanamerAnnotationIntrospector class not found

2024-07-12 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7983:
-

 Summary: CDC query fails with ParanamerAnnotationIntrospector 
class not found
 Key: HUDI-7983
 URL: https://issues.apache.org/jira/browse/HUDI-7983
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


Upon trying out CDC query, following error is seen. 
java.lang.NoClassDefFoundError: 
org/apache/hudi/com/fasterxml/jackson/module/paranamer/ParanamerAnnotationIntrospector
{code:java}
scala> spark.read.option("hoodie.datasource.read.begin.instanttime", 0).
 |   option("hoodie.datasource.query.type", "incremental").
 |   option("hoodie.datasource.query.incremental.format", "cdc").
 |   format("hudi").load(basePath).show(false)
24/07/12 16:16:49 ERROR Executor: Exception in task 0.0 in stage 127.0 (TID 227)
java.lang.NoClassDefFoundError: 
org/apache/hudi/com/fasterxml/jackson/module/paranamer/ParanamerAnnotationIntrospector
at 
org.apache.hudi.cdc.InternalRowToJsonStringConverter.mapper$lzycompute(InternalRowToJsonStringConverter.scala:36)
at 
org.apache.hudi.cdc.InternalRowToJsonStringConverter.mapper(InternalRowToJsonStringConverter.scala:32)
at 
org.apache.hudi.cdc.InternalRowToJsonStringConverter.convert(InternalRowToJsonStringConverter.scala:50)
at 
org.apache.hudi.cdc.CDCFileGroupIterator.convertRowToJsonString(CDCFileGroupIterator.scala:515)
at 
org.apache.hudi.cdc.CDCFileGroupIterator.loadNext(CDCFileGroupIterator.scala:250)
at 
org.apache.hudi.cdc.CDCFileGroupIterator.hasNextInternal(CDCFileGroupIterator.scala:218)
at 
org.apache.hudi.cdc.CDCFileGroupIterator.hasNext(CDCFileGroupIterator.scala:239)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 25 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is also stored in table config

2024-07-09 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7970:
--
Summary: Add support to read partition fields when partition type is also 
stored in table config  (was: Add support to read partition fields when 
partition type is stored in table config)

> Add support to read partition fields when partition type is also stored in 
> table config
> ---
>
> Key: HUDI-7970
> URL: https://issues.apache.org/jira/browse/HUDI-7970
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` 
> to also store partition type. This PR aims to make sure that the getter and 
> other functions accessing this field remain consistent in behaviour with the 
> new value type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7970) Add support to read partition fields when partition type is stored in table config

2024-07-09 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7970:
--
Description: In HUDI-7902, we will modify the config value 
`hoodie.table.partition.fields` to also store partition type. This PR aims to 
make sure that the getter and other functions accessing this field remain 
consistent in behaviour with the new value type.

> Add support to read partition fields when partition type is stored in table 
> config
> --
>
> Key: HUDI-7970
> URL: https://issues.apache.org/jira/browse/HUDI-7970
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> In HUDI-7902, we will modify the config value `hoodie.table.partition.fields` 
> to also store partition type. This PR aims to make sure that the getter and 
> other functions accessing this field remain consistent in behaviour with the 
> new value type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7970) Add support to read partition fields when partition type is stored in table config

2024-07-09 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7970:
-

 Summary: Add support to read partition fields when partition type 
is stored in table config
 Key: HUDI-7970
 URL: https://issues.apache.org/jira/browse/HUDI-7970
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lokesh Jain
Assignee: Lokesh Jain






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7902) Add a new table property to store partition field types for custom key generator

2024-07-05 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863211#comment-17863211
 ] 

Lokesh Jain commented on HUDI-7902:
---

The approach would be to reuse the key `hoodie.table.partition.fields` to store 
the key type for custom key generator. We will be breaking the changes into 
multiple parts here:-
1. Reader side change - We will ensure the reader is able to read the table in 
a backward compatible manner based on the table version. If table version is 
older then it should fall back to older method of using the key else the new 
method. This would also ensure that other users of the key are handled properly.
2. Write side change - Here we will ensure that the new values are reflected in 
the table config and updated in the hoodie.properties as part of upgrade 
handling. We will also need to handle downgrade for this change.
3. We need to update file index so that _partitionSchemaFromProperties handles 
the custom key generator and assigns string type for timestamp based partition 
types.

> Add a new table property to store partition field types for custom key 
> generator
> 
>
> Key: HUDI-7902
> URL: https://issues.apache.org/jira/browse/HUDI-7902
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> As of today, writing a partitioned table with CustomKeyGenerator requires 
> write config of the partition field types (they are not stored as a table 
> property).  CustomKeyGenerator requires partition keys to be in the format of 
> {{field:type}} (e.g. {{column1:SIMPLE}} ). However, only the field names are 
> stored in {{hoodie.properties}} file.  We need to store the field types too 
> so that without the write config, the writer configure the correct partition 
> field name and type in Spark datasource and SQL DML writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7956) Handle upgrade downgrade with cluster action type for pending clustering instants

2024-07-05 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7956:
-

 Summary: Handle upgrade downgrade with cluster action type for 
pending clustering instants
 Key: HUDI-7956
 URL: https://issues.apache.org/jira/browse/HUDI-7956
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lokesh Jain
Assignee: Lokesh Jain


https://issues.apache.org/jira/browse/HUDI-7905 adds a new cluster action type 
for all clustering pending instants. The completed instant still uses the 
replacecommit completed action.
This jira aims to handle upgrade and downgrade of existing tables with this 
change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-05 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7905:
--
Status: Patch Available  (was: In Progress)

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action for requested and inflight 
> instant. This simplifies a few things such as we do not need to scan the 
> replacecommit.requested to determine whether we are looking at clustering 
> plan or not. This would simplify the usage of pending clustering related 
> APIs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-07-05 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7779:
--
Status: Patch Available  (was: In Progress)

> Guarding archival to not archive unintended commits
> ---
>
> Key: HUDI-7779
> URL: https://issues.apache.org/jira/browse/HUDI-7779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, 

[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-05 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7905:
--
Epic Link: HUDI-7856

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action for requested and inflight 
> instant. This simplifies a few things such as we do not need to scan the 
> replacecommit.requested to determine whether we are looking at clustering 
> plan or not. This would simplify the usage of pending clustering related 
> APIs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-02 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7905:
--
Description: Currently, we use replacecommit for clustering, insert 
overwrite and delete partition. Clustering should be a separate action for 
requested and inflight instant. This simplifies a few things such as we do not 
need to scan the replacecommit.requested to determine whether we are looking at 
clustering plan or not. This would simplify the usage of pending clustering 
related APIs.   (was: Currently, we use replacecommit for clustering, insert 
overwrite and delete partition. Clustering should be a separate action. This 
simplifies a few things such as we do not need to scan the 
replacecommit.requested to determine whether we are looking at clustering plan 
or not. This also standardizes the usage of replacecommit to some extent 
(related to HUDI-1739).)

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action for requested and inflight 
> instant. This simplifies a few things such as we do not need to scan the 
> replacecommit.requested to determine whether we are looking at clustering 
> plan or not. This would simplify the usage of pending clustering related 
> APIs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) Use cluster action for clustering pending instants

2024-07-02 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7905:
--
Summary: Use cluster action for clustering pending instants  (was: New 
Action for Clustering)

> Use cluster action for clustering pending instants
> --
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action. This simplifies a few 
> things such as we do not need to scan the replacecommit.requested to 
> determine whether we are looking at clustering plan or not. This also 
> standardizes the usage of replacecommit to some extent (related to HUDI-1739).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7877) Add record position to record index metadata payload

2024-07-02 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain closed HUDI-7877.
-
Resolution: Fixed

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7939) Validate file slices upto a commit in HoodieTableMetadataValidator

2024-06-28 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7939:
-

 Summary: Validate file slices upto a commit in 
HoodieTableMetadataValidator
 Key: HUDI-7939
 URL: https://issues.apache.org/jira/browse/HUDI-7939
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: Lokesh Jain


Currently HoodieTableMetadataValidator validates all the available file slices 
and compares the metadata table fs view against filesystem based fs view. The 
jira aims to use last completed instant and query the views using this instant 
for comparison.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7939) Validate file slices upto a commit in HoodieTableMetadataValidator

2024-06-28 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain reassigned HUDI-7939:
-

Assignee: Lokesh Jain

> Validate file slices upto a commit in HoodieTableMetadataValidator
> --
>
> Key: HUDI-7939
> URL: https://issues.apache.org/jira/browse/HUDI-7939
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> Currently HoodieTableMetadataValidator validates all the available file 
> slices and compares the metadata table fs view against filesystem based fs 
> view. The jira aims to use last completed instant and query the views using 
> this instant for comparison.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7931) Initialize timeline for data table meta client while initializing HoodieBackedTableMetadata

2024-06-26 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain reassigned HUDI-7931:
-

Assignee: Lokesh Jain

> Initialize timeline for data table meta client while initializing 
> HoodieBackedTableMetadata
> ---
>
> Key: HUDI-7931
> URL: https://issues.apache.org/jira/browse/HUDI-7931
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> Currently while initialising HoodieBackedTableMetadata, the data table 
> timeline is not initialized by default whereas metadataMetaClient timeline is 
> initialized while creating metadataFileSystemView. In this jira we aim to 
> initialize timeline for dataMetaClient to ensure that while creating log 
> record scanner dataMetaClient timeline is in sync with metadataMetaClient 
> timeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7931) Initialize timeline for data table meta client while initializing HoodieBackedTableMetadata

2024-06-26 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7931:
-

 Summary: Initialize timeline for data table meta client while 
initializing HoodieBackedTableMetadata
 Key: HUDI-7931
 URL: https://issues.apache.org/jira/browse/HUDI-7931
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: Lokesh Jain


Currently while initialising HoodieBackedTableMetadata, the data table timeline 
is not initialized by default whereas metadataMetaClient timeline is 
initialized while creating metadataFileSystemView. In this jira we aim to 
initialize timeline for dataMetaClient to ensure that while creating log record 
scanner dataMetaClient timeline is in sync with metadataMetaClient timeline.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7395) Fix computation for metrics in HoodieMetadataMetrics

2024-06-24 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-7395.
---

> Fix computation for metrics in HoodieMetadataMetrics
> 
>
> Key: HUDI-7395
> URL: https://issues.apache.org/jira/browse/HUDI-7395
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, metrics
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> For some of the metrics type like duration we are using incrementMetric 
> instead of setMetric.
> Also some of the redundant metrics are removed. For example a count type 
> metric has both count and duration metric getting pushed even though duration 
> is not calculated.
> File lookup count metric is added for bloom filter and column stat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7877:
--
Status: Patch Available  (was: In Progress)

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7841:
--
Status: Patch Available  (was: In Progress)

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7877) Add record position to record index metadata payload

2024-06-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7877:
--
Status: In Progress  (was: Open)

> Add record position to record index metadata payload
> 
>
> Key: HUDI-7877
> URL: https://issues.apache.org/jira/browse/HUDI-7877
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> RLI should save the record position so that can be used in the index lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7905) New Action for Clustering

2024-06-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7905:
--
Status: In Progress  (was: Open)

> New Action for Clustering
> -
>
> Key: HUDI-7905
> URL: https://issues.apache.org/jira/browse/HUDI-7905
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, we use replacecommit for clustering, insert overwrite and delete 
> partition. Clustering should be a separate action. This simplifies a few 
> things such as we do not need to scan the replacecommit.requested to 
> determine whether we are looking at clustering plan or not. This also 
> standardizes the usage of replacecommit to some extent (related to HUDI-1739).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7841) RLI and secondary index should consider only pruned partitions for file skipping

2024-06-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7841:
--
Status: In Progress  (was: Open)

> RLI and secondary index should consider only pruned partitions for file 
> skipping
> 
>
> Key: HUDI-7841
> URL: https://issues.apache.org/jira/browse/HUDI-7841
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Even though RLI scans only matching files, it tries to get those candidate 
> files by iterating over all files from file index. See - 
> [https://github.com/apache/hudi/blob/f4be74c29471fbd6afff472f8db292e6b1f16f05/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala#L47]
> Instead, it can use the `prunedPartitionsAndFileSlices` to only consider 
> pruned partitions whenever there is a partition predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7719) Introduce capability to specify config value as a time duration

2024-05-06 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7719:
--
Summary: Introduce capability to specify config value as a time duration  
(was: Introduce capability to specify ConfigProperty as a time duration)

> Introduce capability to specify config value as a time duration
> ---
>
> Key: HUDI-7719
> URL: https://issues.apache.org/jira/browse/HUDI-7719
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>
> Currently the config values are mentioned in seconds or a fixed chrono unit. 
> We should also have support for specifying these config values as a time 
> duration like 5m, 60s for ease of use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7719) Introduce capability to specify ConfigProperty as a time duration

2024-05-06 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7719:
-

 Summary: Introduce capability to specify ConfigProperty as a time 
duration
 Key: HUDI-7719
 URL: https://issues.apache.org/jira/browse/HUDI-7719
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


Currently the config values are mentioned in seconds or a fixed chrono unit. We 
should also have support for specifying these config values as a time duration 
like 5m, 60s for ease of use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7571) Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode

2024-04-03 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7571:
-

 Summary: Add api to get exception details in 
HoodieMetadataTableValidator with ignoreFailed mode
 Key: HUDI-7571
 URL: https://issues.apache.org/jira/browse/HUDI-7571
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure and 
continues the validation. This jira aims to add api to get list of exceptions 
and an api to check if validation exception was thrown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7524) Ensure existing hoodie.properties are not overwritten with HoodieTableMetaClient creation

2024-03-21 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7524:
--
Description: 
org.apache.hudi.common.table.HoodieTableMetaClient#initTableAndGetMetaClient 
can overwrite a existing `hoodie.properties` today. The jira ensures that an 
error is thrown if the file already exists.

> Ensure existing hoodie.properties are not overwritten with 
> HoodieTableMetaClient creation
> -
>
> Key: HUDI-7524
> URL: https://issues.apache.org/jira/browse/HUDI-7524
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> org.apache.hudi.common.table.HoodieTableMetaClient#initTableAndGetMetaClient 
> can overwrite a existing `hoodie.properties` today. The jira ensures that an 
> error is thrown if the file already exists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7524) Ensure existing hoodie.properties are not overwritten with HoodieTableMetaClient creation

2024-03-21 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7524:
-

 Summary: Ensure existing hoodie.properties are not overwritten 
with HoodieTableMetaClient creation
 Key: HUDI-7524
 URL: https://issues.apache.org/jira/browse/HUDI-7524
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7395) Fix computation for metrics in HoodieMetadataMetrics

2024-02-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7395:
--
Summary: Fix computation for metrics in HoodieMetadataMetrics  (was: Fix 
naming and computation for metrics in HoodieMetadataMetrics)

> Fix computation for metrics in HoodieMetadataMetrics
> 
>
> Key: HUDI-7395
> URL: https://issues.apache.org/jira/browse/HUDI-7395
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, metrics
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> For some of the metrics type like duration we are using incrementMetric 
> instead of setMetric.
> Also some of the redundant metrics are removed. For example a count type 
> metric has both count and duration metric getting pushed even though duration 
> is not calculated.
> File lookup count metric is added for bloom filter and column stat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7395) Fix naming and computation for metrics in HoodieMetadataMetrics

2024-02-08 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7395:
-

 Summary: Fix naming and computation for metrics in 
HoodieMetadataMetrics
 Key: HUDI-7395
 URL: https://issues.apache.org/jira/browse/HUDI-7395
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata, metrics
Reporter: Lokesh Jain
Assignee: Lokesh Jain


For some of the metrics type like duration we are using incrementMetric instead 
of setMetric.
Also some of the redundant metrics are removed. For example a count type metric 
has both count and duration metric getting pushed even though duration is not 
calculated.
File lookup count metric is added for bloom filter and column stat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7391) HoodieMetadataMetrics should use Metrics instance for metrics registry

2024-02-06 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7391:
--
Description: 
Currently HoodieMetadataMetrics stores metrics in memory and these metrics are 
not pushed by the metric reporters. The metric reporters are configured within 
Metrics instance. List of changes in the PR:

1. Metrics related classes have been moved from hudi-client-common to 
hudi-common.
2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
be supported with it.
3. Some gaps in configs which are added in HoodieMetadataWriteUtils
4. Some metrics related apis and functionality has been moved to 
HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to 
HoodieMetricsConfig for the functionality.


  was:
Currently HoodieMetadataMetrics stores metrics in memory and these metrics are 
not pushed by the metric reporters. The metric reporters are configured within 
Metrics instance. List of changes in the PR:

1. Metrics related classes have been moved from hudi-client-common to 
hudi-common.
2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
be supported with it.
3. Some gaps in configs which are added in HoodieMetadataWriteUtils



> HoodieMetadataMetrics should use Metrics instance for metrics registry
> --
>
> Key: HUDI-7391
> URL: https://issues.apache.org/jira/browse/HUDI-7391
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata, metrics
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> Currently HoodieMetadataMetrics stores metrics in memory and these metrics 
> are not pushed by the metric reporters. The metric reporters are configured 
> within Metrics instance. List of changes in the PR:
> 1. Metrics related classes have been moved from hudi-client-common to 
> hudi-common.
> 2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
> be supported with it.
> 3. Some gaps in configs which are added in HoodieMetadataWriteUtils
> 4. Some metrics related apis and functionality has been moved to 
> HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to 
> HoodieMetricsConfig for the functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7391) HoodieMetadataMetrics should use Metrics instance for metrics registry

2024-02-06 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7391:
-

 Summary: HoodieMetadataMetrics should use Metrics instance for 
metrics registry
 Key: HUDI-7391
 URL: https://issues.apache.org/jira/browse/HUDI-7391
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata, metrics
Reporter: Lokesh Jain
Assignee: Lokesh Jain


Currently HoodieMetadataMetrics stores metrics in memory and these metrics are 
not pushed by the metric reporters. The metric reporters are configured within 
Metrics instance. List of changes in the PR:

1. Metrics related classes have been moved from hudi-client-common to 
hudi-common.
2. HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
be supported with it.
3. Some gaps in configs which are added in HoodieMetadataWriteUtils




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7120) Performance improvements in deltastreamer executor code path

2023-11-23 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-7120.
---

> Performance improvements in deltastreamer executor code path
> 
>
> Key: HUDI-7120
> URL: https://issues.apache.org/jira/browse/HUDI-7120
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Makes improvements based on findings from CPU profiling for the executor code 
> path.
> 1. Fixes repetitive execution of string split operation
> 2. reduces number of validation calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7120) Performance improvements in deltastreamer executor code path

2023-11-23 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-7120:
--
Fix Version/s: 0.14.1

> Performance improvements in deltastreamer executor code path
> 
>
> Key: HUDI-7120
> URL: https://issues.apache.org/jira/browse/HUDI-7120
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Makes improvements based on findings from CPU profiling for the executor code 
> path.
> 1. Fixes repetitive execution of string split operation
> 2. reduces number of validation calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7120) Performance improvements in deltastreamer executor code path

2023-11-17 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7120:
-

 Summary: Performance improvements in deltastreamer executor code 
path
 Key: HUDI-7120
 URL: https://issues.apache.org/jira/browse/HUDI-7120
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


Makes improvements based on findings from CPU profiling for the executor code 
path.

1. Fixes repetitive execution of string split operation
2. reduces number of validation calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6896) HoodieAvroHFileReader.RecordIterator iteration never terminates

2023-09-26 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain reassigned HUDI-6896:
-

Assignee: Lokesh Jain

> HoodieAvroHFileReader.RecordIterator iteration never terminates
> ---
>
> Key: HUDI-6896
> URL: https://issues.apache.org/jira/browse/HUDI-6896
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses 
> org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first 
> line of the file.
> {code:java}
> if (!scanner.isSeeked()) {
>   hasRecords = scanner.seekTo();
> }
> {code}
> if isSeeked returns false, scanner seeks to start of file.
> After end of file is reached, isSeeked would still return false and the next 
> time hasNext is called it seeks to start of file again leading to an infinite 
> loop.
> Documentation for HFileScanner#isSeeked 
> True is scanner has had one of the seek calls invoked; i.e. seekBefore(Cell) 
> or seekTo() or seekTo(Cell). Otherwise returns false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6896) HoodieAvroHFileReader.RecordIterator iteration never terminates

2023-09-26 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6896:
--
Description: 
org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses 
org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first 
line of the file.
{code:java}
if (!scanner.isSeeked()) {
  hasRecords = scanner.seekTo();
}
{code}
if isSeeked returns false, scanner seeks to start of file.

After end of file is reached, isSeeked would still return false and the next 
time hasNext is called it seeks to start of file again leading to an infinite 
loop.

Documentation for HFileScanner#isSeeked 
True is scanner has had one of the seek calls invoked; i.e. seekBefore(Cell) or 
seekTo() or seekTo(Cell). Otherwise returns false.

  was:org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext 
uses org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the 
first line of the file.


> HoodieAvroHFileReader.RecordIterator iteration never terminates
> ---
>
> Key: HUDI-6896
> URL: https://issues.apache.org/jira/browse/HUDI-6896
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>
> org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses 
> org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first 
> line of the file.
> {code:java}
> if (!scanner.isSeeked()) {
>   hasRecords = scanner.seekTo();
> }
> {code}
> if isSeeked returns false, scanner seeks to start of file.
> After end of file is reached, isSeeked would still return false and the next 
> time hasNext is called it seeks to start of file again leading to an infinite 
> loop.
> Documentation for HFileScanner#isSeeked 
> True is scanner has had one of the seek calls invoked; i.e. seekBefore(Cell) 
> or seekTo() or seekTo(Cell). Otherwise returns false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6896) HoodieAvroHFileReader.RecordIterator iteration never terminates

2023-09-26 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6896:
-

 Summary: HoodieAvroHFileReader.RecordIterator iteration never 
terminates
 Key: HUDI-6896
 URL: https://issues.apache.org/jira/browse/HUDI-6896
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


org.apache.hudi.io.storage.HoodieAvroHFileReader.RecordIterator#hasNext uses 
org.apache.hadoop.hbase.io.hfile.HFileScanner#isSeeked to seek to the first 
line of the file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6890) Fail fast or auto determine record type during read

2023-09-22 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6890:
-

 Summary: Fail fast or auto determine record type during read
 Key: HUDI-6890
 URL: https://issues.apache.org/jira/browse/HUDI-6890
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


HUDI-6753 fixes some tests where read was failing because the record merger 
implementation config was not set during 
read(HoodieWriteConfig.RECORD_MERGER_IMPLS.key() -> 
HoodieSparkRecordMerger.class.getName()). The failure was midway during parquet 
read. Please check the mentioned jira for more details on the stacktrace.

This jira aims to make a change where either the read fails fast or the merger 
implementation is auto determined during read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6889) Improve usability of bulk insert with insert overwrite operations in Spark Datasource

2023-09-22 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6889:
-

 Summary: Improve usability of bulk insert with insert overwrite 
operations in Spark Datasource
 Key: HUDI-6889
 URL: https://issues.apache.org/jira/browse/HUDI-6889
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


Currently for using bulk insert with insert overwrite operations in Spark 
Datasource, user would currently have to set 
`hoodie.bulkinsert.overwrite.operation.type` to 
insert_overwrite_table(OVERWRITE) or insert_overwrite(APPEND) and use the 
corresponding save modes. Since this is an internal config, it should not be 
exposed. The jira aims to find an easier way to support the feature for the 
users through a new config or a different config altogether.
One idea is to deprecate hoodie.spark.sql.insert.into.operation and create a 
new config which can be shared by both sql and datasource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6887) Add test for Record Index and MIT queries

2023-09-21 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6887:
-

 Summary: Add test for Record Index and MIT queries
 Key: HUDI-6887
 URL: https://issues.apache.org/jira/browse/HUDI-6887
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


The jira creates test for RLI and MIT queries with checks for validating that 
data skipping is occurring.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-12 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764274#comment-17764274
 ] 

Lokesh Jain commented on HUDI-6820:
---

After enabling more debug logs, created a few PRs.
https://github.com/apache/hudi/actions/runs/6158416987/job/16711145162?pr=9690 .
I can see that all annotations with @After complete before timeout

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-12 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764273#comment-17764273
 ] 

Lokesh Jain commented on HUDI-6820:
---

| Created PRs [https://github.com/apache/hudi/pull/9676] to 9678 after 
disabling TestHoodieCombineHiveInputFormat

Usually I see timeout in hudi-java-client after disabling this test. Otherwise 
we see timeout in both hadoop-mr and java-client.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-11 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763849#comment-17763849
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created [https://github.com/apache/hudi/pull/9683] after enabling debug logs 
for maven. Only the newly added github check would run here. Should ideally 
print out more logs.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-11 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763846#comment-17763846
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created [https://github.com/apache/hudi/pull/9682] to add 40 minutes timeout 
for hudi-hadoop-mr check. The github check takes 6 hours to timeout currently.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-11 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763838#comment-17763838
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created PRs https://github.com/apache/hudi/pull/9676 to 9678 after disabling 
TestHoodieCombineHiveInputFormat. Was seeing flakiness in github check for 
separated hudi-mr and java-client module. Out of 3 prs, 2 passed but 1 will 
timeout.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6833) Add field for tracking log files from failed commit in rollback metadata

2023-09-10 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain closed HUDI-6833.
-
Fix Version/s: 0.14.0
   Resolution: Fixed

> Add field for tracking log files from failed commit in rollback metadata
> 
>
> Key: HUDI-6833
> URL: https://issues.apache.org/jira/browse/HUDI-6833
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The Jira aims to add field for tracking log files from failed commit in the 
> rollback metadata. The fix for using this field would be tracked in HUDI-6761.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-08 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763180#comment-17763180
 ] 

Lokesh Jain commented on HUDI-6820:
---

Also created PRs [https://github.com/apache/hudi/pull/9654/files] to 9657 for 
moving hudi-hadoop-mr and hudi-java-client to a separate job. Timeout errors 
are still seen in the UT FT jobs.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-08 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763174#comment-17763174
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created a PR to move hudi-hadoop-mr and hudi-java-client to github action. 
[https://github.com/apache/hudi/pull/9661]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-08 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763175#comment-17763175
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created a PR which disables deltastreamer continuous mode tests. 
[https://github.com/apache/hudi/pull/9662]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-08 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763171#comment-17763171
 ] 

Lokesh Jain commented on HUDI-6820:
---

PRs [https://github.com/apache/hudi/pull/9604] to 9607 had been run on 0.13.0 
branch. Out of total 8 PRs, 4 failed with timeout and 4 others were cancelled 
because of other critical PRs. The issue is reproducible with 0.13.0.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6833) Add field for tracking log files from failed commit in rollback metadata

2023-09-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6833:
--
Description: 
The Jira aims to add field for tracking log files from failed commit in the 
rollback metadata. The fix for using this field would be tracked in HUDI-6761.

 

  was:The Jira aims to add field for tracking log files from failed commit in 
the rollback metadata. The fix for using this field would be tracked in 
HUDI-6758.


> Add field for tracking log files from failed commit in rollback metadata
> 
>
> Key: HUDI-6833
> URL: https://issues.apache.org/jira/browse/HUDI-6833
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> The Jira aims to add field for tracking log files from failed commit in the 
> rollback metadata. The fix for using this field would be tracked in HUDI-6761.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6833) Add field for tracking log files from failed commit in rollback metadata

2023-09-08 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6833:
-

 Summary: Add field for tracking log files from failed commit in 
rollback metadata
 Key: HUDI-6833
 URL: https://issues.apache.org/jira/browse/HUDI-6833
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


The Jira aims to add field for tracking log files from failed commit in the 
rollback metadata. The fix for using this field would be tracked in HUDI-6758.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-07 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762835#comment-17762835
 ] 

Lokesh Jain edited comment on HUDI-6820 at 9/7/23 6:31 PM:
---

First known occurrence of timeout issue (12th June) - 
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=17760=results]


was (Author: ljain):
First known occurrence of timeout issue (28th June) - 
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=results]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-07 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762835#comment-17762835
 ] 

Lokesh Jain commented on HUDI-6820:
---

First known occurrence of timeout issue (28th June) - 
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=results]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-07 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762734#comment-17762734
 ] 

Lokesh Jain commented on HUDI-6820:
---

The maven version is 3.8.8 and has not changed since July.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6830) Fix downgrade from version six for partially failed commits

2023-09-07 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6830:
-

 Summary: Fix downgrade from version six for partially failed 
commits
 Key: HUDI-6830
 URL: https://issues.apache.org/jira/browse/HUDI-6830
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


With the new version six, if table has pending commits then these commits 
should be rolled back during downgrade so that files created using the new 
format are cleaned up properly. The Jira aims to fix the downgrade handler to 
support this step.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762470#comment-17762470
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created [https://github.com/apache/hudi/pull/9604] , PRs 9605-9609, 
[https://github.com/apache/hudi/pull/9632] and 
[https://github.com/apache/hudi/pull/9633]  for testing Azure CI on 0.13.0 hudi 
branch.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762467#comment-17762467
 ] 

Lokesh Jain commented on HUDI-6820:
---

Created 
[https://github.com/apache/hudi/pull/962|https://github.com/apache/hudi/pull/9627]7
 and [https://github.com/apache/hudi/pull/9628] with disabled 
{{TestHoodieRealtimeRecordReader}} .

UT FT timed out after 4 hours in both the PRs above.

Also FT client/spark-client also timed out in the CI run

[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19671=logs=7601efb9-4019-552e-11ba-eb31b66593b2=9688f101-287d-53f4-2a80-87202516f5d0]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762465#comment-17762465
 ] 

Lokesh Jain commented on HUDI-6820:
---

Found a microsoft developer ticket for similar issue where Azure CI is getting 
timeout. 
[https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all]
 
 
 

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762465#comment-17762465
 ] 

Lokesh Jain edited comment on HUDI-6820 at 9/6/23 5:29 PM:
---

Found a microsoft developer ticket for similar issue where Azure CI is getting 
timeout. 
[https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all]


was (Author: ljain):
Found a microsoft developer ticket for similar issue where Azure CI is getting 
timeout. 
[https://developercommunity.visualstudio.com/t/errorthe-operation-was-canceled-azure-devops-build/692048?viewtype=all]
 
 
 

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762464#comment-17762464
 ] 

Lokesh Jain commented on HUDI-6820:
---

Usually the error message is of the form:
{code:java}
,##[error]We stopped hearing from agent Hosted Agent. Verify the agent machine 
is running and has a healthy network connection. Anything that terminates an 
agent process, starves it for CPU, or blocks its network access can cause this 
error. For more information, see: 
https://go.microsoft.com/fwlink/?linkid=846610 {code}
Or of the form:
{code:java}
The job running on agent Azure Pipelines 7 ran longer than the maximum time of 
150 minutes. {code}
It could be an actual timeout where tests were running for 150 minutes. But in 
many cases issue we are seeing is that raw logs show
{code:java}
2023-07-28T06:01:14.7898970Zat 
java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_372]
2023-07-28T06:01:14.7899483Zat 
org.apache.hudi.metrics.Metrics.shutdown(Metrics.java:116) 
~[hudi-client-common-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
2023-07-28T06:01:14.7899862Zat java.lang.Thread.run(Thread.java:750) 
~[?:1.8.0_372]
2023-07-28T07:27:41.0195722Z ##[error]The operation was canceled.
2023-07-28T07:27:41.0242402Z ##[section]Finishing: FT client/spark-client
 {code}
There is a gap of 1 hr and more between last test run and the operation 
cancellation.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762463#comment-17762463
 ] 

Lokesh Jain edited comment on HUDI-6820 at 9/6/23 5:24 PM:
---

Some older runs where similar issue can be seen:
FT client/spark-client:
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18864=logs=7601efb9-4019-552e-11ba-eb31b66593b2|https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]
 
UT FT other modules
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]


was (Author: ljain):
Some older runs where similar issue can be seen:
FT client/spark-client:
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]
 
UT FT other modules
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762463#comment-17762463
 ] 

Lokesh Jain commented on HUDI-6820:
---

Some older runs where similar issue can be seen:
FT client/spark-client:
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]
 
UT FT other modules
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18152=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0]

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762462#comment-17762462
 ] 

Lokesh Jain commented on HUDI-6820:
---

For some runs like - 
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19384=logs=ba200224-5437-5e21-2643-114ac65587f4]
 attempt 2 and 3 here, the timeout occurs after we see 87 tests have run.

For some of the others like 
[https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19500/logs/7],
 timeout occurs after we see 15 tests have run.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762461#comment-17762461
 ] 

Lokesh Jain commented on HUDI-6820:
---

Siva had tried disabling tests to see if they were causing timeouts via PRs -

[https://github.com/apache/hudi/pull/9543]

[https://github.com/apache/hudi/pull/9550/files]

[https://github.com/apache/hudi/pull/9542]

[https://github.com/apache/hudi/pull/9551]

But the timeout issue was still visible.

> Fix Azure CI timeout for UT FT other modules
> 
>
> Key: HUDI-6820
> URL: https://issues.apache.org/jira/browse/HUDI-6820
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6820) Fix Azure CI timeout for UT FT other modules

2023-09-06 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6820:
-

 Summary: Fix Azure CI timeout for UT FT other modules
 Key: HUDI-6820
 URL: https://issues.apache.org/jira/browse/HUDI-6820
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6753) Fix parquet inline reading flaky test

2023-09-05 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain reassigned HUDI-6753:
-

Assignee: Lokesh Jain

> Fix parquet inline reading flaky test
> -
>
> Key: HUDI-6753
> URL: https://issues.apache.org/jira/browse/HUDI-6753
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Major
>
> Sometimes we see some flakiness around parquet inline reading. 
>  
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19457/logs/8]
>  
>  
> {code:java}
> 2023-08-25T05:00:14.1359469Z 1389627 [Executor task launch worker for task 
> 1.0 in stage 4124.0 (TID 5621)] ERROR 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got 
> exception when reading log file
> 2023-08-25T05:00:14.1360427Z org.apache.hudi.exception.HoodieException: 
> unable to read next record from parquet file 
> 2023-08-25T05:00:14.1361525Z  at 
> org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1362403Z  at 
> org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1363340Z  at 
> org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1364854Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:625)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1365985Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:667)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1367473Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:362)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1368371Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1369127Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:201)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1369901Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:117)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1370633Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:76)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1371380Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:466)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1372312Z  at 
> org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:371) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1372915Z  at 
> org.apache.hudi.LogFileIterator.(Iterators.scala:110) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1373549Z  at 
> org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:201) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1374172Z  at 
> org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:212) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1374809Z  at 
> org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:217) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1375480Z  at 
> org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1376156Z  at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
> ~[spark-core_2.12-3.2.3.jar:3.2.3]
> 2023-08-25T05:00:14.1376653Z  at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
> ~[spark-core_2.12-3.2.3.jar:3.2.3]
> 2023-08-25T05:00:14.1377283Z  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> ~[spark-core_2.12-3.2.3.jar:3.2.3]
> 2023-08-25T05:00:14.1377837Z  at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
> ~[spark-core_2.12-3.2.3.jar:3.2.3]
> 2023-08-25T05:00:14.1378323Z  at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
> 

[jira] [Commented] (HUDI-6753) Fix parquet inline reading flaky test

2023-09-01 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761435#comment-17761435
 ] 

Lokesh Jain commented on HUDI-6753:
---

`org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport#init`

creates a new parquet schema after converting the struct ype fields for 
(DECIMAL(10,6)
but org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport#init 
doesn't do a similar coversion while reading leading to read error.

> Fix parquet inline reading flaky test
> -
>
> Key: HUDI-6753
> URL: https://issues.apache.org/jira/browse/HUDI-6753
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>
> Sometimes we see some flakiness around parquet inline reading. 
>  
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19457/logs/8]
>  
>  
> {code:java}
> 2023-08-25T05:00:14.1359469Z 1389627 [Executor task launch worker for task 
> 1.0 in stage 4124.0 (TID 5621)] ERROR 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got 
> exception when reading log file
> 2023-08-25T05:00:14.1360427Z org.apache.hudi.exception.HoodieException: 
> unable to read next record from parquet file 
> 2023-08-25T05:00:14.1361525Z  at 
> org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1362403Z  at 
> org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1363340Z  at 
> org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1364854Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:625)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1365985Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:667)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1367473Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:362)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1368371Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1369127Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:201)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1369901Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:117)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1370633Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:76)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1371380Z  at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:466)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1372312Z  at 
> org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:371) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1372915Z  at 
> org.apache.hudi.LogFileIterator.(Iterators.scala:110) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1373549Z  at 
> org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:201) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1374172Z  at 
> org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:212) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1374809Z  at 
> org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:217) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1375480Z  at 
> org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109) 
> ~[hudi-spark-common_2.12-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1376156Z  at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
> ~[spark-core_2.12-3.2.3.jar:3.2.3]
> 2023-08-25T05:00:14.1376653Z  at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
> ~[spark-core_2.12-3.2.3.jar:3.2.3]
> 2023-08-25T05:00:14.1377283Z  at 
> 

[jira] [Commented] (HUDI-6753) Fix parquet inline reading flaky test

2023-09-01 Thread Lokesh Jain (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761294#comment-17761294
 ] 

Lokesh Jain commented on HUDI-6753:
---

The stacktrace is always seen for the test even when test passes.

The stacktrace shows that the file schema and requested schema differs while 
reading parquet. FileSchema is the schema from parquet file and requested 
schema is schema during read.

File schema:

 
{code:java}
message spark_schema {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
required int64 timestamp;
required binary _row_key (STRING);
required binary partition_path (STRING);
required binary trip_type (STRING);
required binary rider (STRING);
required binary driver (STRING);
required double begin_lat;
required double begin_lon;
required double end_lat;
required double end_lon;
required int32 distance_in_meters;
required int64 seconds_since_epoch;
required float weight;
required binary nation;
required int32 current_date (DATE);
required int64 current_ts;
required int64 height (DECIMAL(10,6));
required group city_to_state (MAP) {
repeated group key_value
{ required binary key (STRING); required binary value (STRING); }
}
required group fare { required double amount; required binary currency 
(STRING); }
required group tip_history (LIST) {
repeated group list {
required group element { required double amount; required binary currency 
(STRING); }
}
}
required boolean _hoodie_is_deleted;
required double haversine_distance;
}{code}
 


Requested schema:
{code:java}
message triprec {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
required int64 timestamp;
required binary _row_key (STRING);
required binary partition_path (STRING);
required binary trip_type (STRING);
required binary rider (STRING);
required binary driver (STRING);
required double begin_lat;
required double begin_lon;
required double end_lat;
required double end_lon;
required int32 distance_in_meters;
required int64 seconds_since_epoch;
required float weight;
required binary nation;
required int32 current_date (DATE);
required int64 current_ts;
required fixed_len_byte_array(5) height (DECIMAL(10,6));
required group city_to_state (MAP) {
repeated group key_value { required binary key (STRING); required binary value 
(STRING); }
}
required group fare
{ required double amount; required binary currency (STRING); }
required group tip_history (LIST) {
repeated group list {
required group element
{ required double amount; required binary currency (STRING); }
}
}
required boolean _hoodie_is_deleted;
required double haversine_distance;
}
{code}
 

> Fix parquet inline reading flaky test
> -
>
> Key: HUDI-6753
> URL: https://issues.apache.org/jira/browse/HUDI-6753
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>
> Sometimes we see some flakiness around parquet inline reading. 
>  
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19457/logs/8]
>  
>  
> {code:java}
> 2023-08-25T05:00:14.1359469Z 1389627 [Executor task launch worker for task 
> 1.0 in stage 4124.0 (TID 5621)] ERROR 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got 
> exception when reading log file
> 2023-08-25T05:00:14.1360427Z org.apache.hudi.exception.HoodieException: 
> unable to read next record from parquet file 
> 2023-08-25T05:00:14.1361525Z  at 
> org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1362403Z  at 
> org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1363340Z  at 
> org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1364854Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:625)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1365985Z  at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:667)
>  ~[hudi-common-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
> 2023-08-25T05:00:14.1367473Z  at 
> 

[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks

2023-08-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6621:
--
Description: 
In table version 6, we introduce a new delete block format (v3) with Avro serde 
(HUDI-5760).  For downgrading a table from v6 to v5, we need to perform 
compaction to handle v3 delete blocks created using the new format.
Also with the addition of record index field in Metadata table schema, the 
downgrade needs to delete the metadata table to avoid column drop errors after 
downgrade.

  was:In table version 6, we introduce a new delete block format (v3) with Avro 
serde (HUDI-5760).  For downgrading a table from v6 to v5, we need to perform 
compaction to handle v3 delete blocks created using the new format.


> Add a downgrade step from 6 to 5 to detect new delete blocks
> 
>
> Key: HUDI-6621
> URL: https://issues.apache.org/jira/browse/HUDI-6621
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> In table version 6, we introduce a new delete block format (v3) with Avro 
> serde (HUDI-5760).  For downgrading a table from v6 to v5, we need to perform 
> compaction to handle v3 delete blocks created using the new format.
> Also with the addition of record index field in Metadata table schema, the 
> downgrade needs to delete the metadata table to avoid column drop errors 
> after downgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6621) Add a downgrade step from 6 to 5 to detect new delete blocks

2023-08-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6621:
--
Description: In table version 6, we introduce a new delete block format 
(v3) with Avro serde (HUDI-5760).  For downgrading a table from v6 to v5, we 
need to perform compaction to handle v3 delete blocks created using the new 
format.  (was: In table version 6, we introduce a new delete block format (v3) 
with Avro serde (HUDI-5760).  For downgrading a table from v6 to v5, we need to 
check any v3 delete blocks using the new format and ask user to manually 
restore to a commit before any file slice with a v3 delete block.)

> Add a downgrade step from 6 to 5 to detect new delete blocks
> 
>
> Key: HUDI-6621
> URL: https://issues.apache.org/jira/browse/HUDI-6621
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.14.0
>
>
> In table version 6, we introduce a new delete block format (v3) with Avro 
> serde (HUDI-5760).  For downgrading a table from v6 to v5, we need to perform 
> compaction to handle v3 delete blocks created using the new format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6717) Fix downgrade handler for 0.14.0

2023-08-19 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain closed HUDI-6717.
-
Resolution: Duplicate

> Fix downgrade handler for 0.14.0
> 
>
> Key: HUDI-6717
> URL: https://issues.apache.org/jira/browse/HUDI-6717
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Since the log block version (due to delete block change) has been upgraded in 
> 0.14.0, the delete blocks can not be read in 0.13.0 or earlier.
> Similarly the addition of record level index field in metadata table leads to 
> column drop error on downgrade. The Jira aims to fix the downgrade handler to 
> trigger compaction and delete metadata table if user wishes to downgrade from 
> version six (0.14.0) to version 5 (0.13.0).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6726) Fix connection leaks related to file reader and iterator close

2023-08-18 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6726:
-

 Summary: Fix connection leaks related to file reader and iterator 
close
 Key: HUDI-6726
 URL: https://issues.apache.org/jira/browse/HUDI-6726
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata, reader-core
Reporter: Lokesh Jain


The Jira aims to fix connection leaks caused due to non closure of file readers 
and iterators used to iterate the records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6717) Fix downgrade handler for 0.14.0

2023-08-17 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6717:
-

 Summary: Fix downgrade handler for 0.14.0
 Key: HUDI-6717
 URL: https://issues.apache.org/jira/browse/HUDI-6717
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


Since the log block version (due to delete block change) has been upgraded in 
0.14.0, the delete blocks can not be read in 0.13.0 or earlier.
Similarly the addition of record level index field in metadata table leads to 
column drop error on downgrade. The Jira aims to fix the downgrade handler to 
trigger compaction and delete metadata table if user wishes to downgrade from 
version six (0.14.0) to version 5 (0.13.0).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6677) Make HoodieRecordIndexInfo schema compatible with older versions

2023-08-10 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain closed HUDI-6677.
-
Resolution: Not A Problem

> Make HoodieRecordIndexInfo schema compatible with older versions
> 
>
> Key: HUDI-6677
> URL: https://issues.apache.org/jira/browse/HUDI-6677
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently the metadata payload schema for record index can cause schema 
> evolution issues for existing hudi tables. The Jira aims to fix these issues. 
> There are two schema evolution issues -:
> 1. The field name has changed from partition to partitionName.
> 2. Also we have added a new field fileId in between a nested schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6677) Make HoodieRecordIndexInfo schema compatible with older versions

2023-08-10 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6677:
-

 Summary: Make HoodieRecordIndexInfo schema compatible with older 
versions
 Key: HUDI-6677
 URL: https://issues.apache.org/jira/browse/HUDI-6677
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: Lokesh Jain


Currently the metadata payload schema for record index can cause schema 
evolution issues for existing hudi tables. The Jira aims to fix these issues. 
There are two schema evolution issues -:
1. The field name has changed from partition to partitionName.
2. Also we have added a new field fileId in between a nested schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6459) Add Rollback and other tests for Record Level Index

2023-08-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-6459.
---

> Add Rollback and other tests for Record Level Index
> ---
>
> Key: HUDI-6459
> URL: https://issues.apache.org/jira/browse/HUDI-6459
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> The Jira aims to add validation for rollback with record level index. The 
> validation is added in TestRecordLevelIndex test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6393) Add more RLI tests and fix HoodieTestTable accordingly

2023-08-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6393:
--
Fix Version/s: 0.14.0

> Add more RLI tests and fix HoodieTestTable accordingly
> --
>
> Key: HUDI-6393
> URL: https://issues.apache.org/jira/browse/HUDI-6393
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> https://github.com/apache/hudi/pull/8758#discussion_r1213866286



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6393) Add more RLI tests and fix HoodieTestTable accordingly

2023-08-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-6393.
---

> Add more RLI tests and fix HoodieTestTable accordingly
> --
>
> Key: HUDI-6393
> URL: https://issues.apache.org/jira/browse/HUDI-6393
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/pull/8758#discussion_r1213866286



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6459) Add Rollback and other tests for Record Level Index

2023-08-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6459:
--
Fix Version/s: 0.14.0

> Add Rollback and other tests for Record Level Index
> ---
>
> Key: HUDI-6459
> URL: https://issues.apache.org/jira/browse/HUDI-6459
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The Jira aims to add validation for rollback with record level index. The 
> validation is added in TestRecordLevelIndex test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled

2023-08-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-6660.
---

> For merge into use primary key constraint when optimized writes are enabled
> ---
>
> Key: HUDI-6660
> URL: https://issues.apache.org/jira/browse/HUDI-6660
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently merge into fails when primary key is a complex key and join 
> conditions do not include all the primary key columns. The Jira aims to 
> restrict the constraint only when optimized writes are enabled. With the 
> optimized writes disabled, merge into doesn't update the records if the 
> primary key does not match completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled

2023-08-08 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6660:
--
Fix Version/s: 0.14.0

> For merge into use primary key constraint when optimized writes are enabled
> ---
>
> Key: HUDI-6660
> URL: https://issues.apache.org/jira/browse/HUDI-6660
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently merge into fails when primary key is a complex key and join 
> conditions do not include all the primary key columns. The Jira aims to 
> restrict the constraint only when optimized writes are enabled. With the 
> optimized writes disabled, merge into doesn't update the records if the 
> primary key does not match completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6660:
--
Description: Currently merge into fails when primary key is a complex key 
and join conditions do not include all the primary key columns. The Jira aims 
to restrict the constraint only when optimized writes are enabled. With the 
optimized writes disabled, merge into doesn't update the records if the primary 
key does not match completely.  (was: Currently merge into fails when primary 
key is a complex key and join conditions do not include all the primary key 
columns. The Jira aims to relax this requirement to allow join even on a subset 
of primary key columns.)

> For merge into use primary key constraint when optimized writes are enabled
> ---
>
> Key: HUDI-6660
> URL: https://issues.apache.org/jira/browse/HUDI-6660
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently merge into fails when primary key is a complex key and join 
> conditions do not include all the primary key columns. The Jira aims to 
> restrict the constraint only when optimized writes are enabled. With the 
> optimized writes disabled, merge into doesn't update the records if the 
> primary key does not match completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6660) For merge into use primary key constraint when optimized writes are enabled

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6660:
--
Summary: For merge into use primary key constraint when optimized writes 
are enabled  (was: Primary key constraint should be applicable when optimized 
writes are enabled)

> For merge into use primary key constraint when optimized writes are enabled
> ---
>
> Key: HUDI-6660
> URL: https://issues.apache.org/jira/browse/HUDI-6660
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently merge into fails when primary key is a complex key and join 
> conditions do not include all the primary key columns. The Jira aims to relax 
> this requirement to allow join even on a subset of primary key columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6660) Primary key constraint should be applicable when optimized writes are enabled

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6660:
--
Summary: Primary key constraint should be applicable when optimized writes 
are enabled  (was: Relax primary key constraint for merge into join condition)

> Primary key constraint should be applicable when optimized writes are enabled
> -
>
> Key: HUDI-6660
> URL: https://issues.apache.org/jira/browse/HUDI-6660
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently merge into fails when primary key is a complex key and join 
> conditions do not include all the primary key columns. The Jira aims to relax 
> this requirement to allow join even on a subset of primary key columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6660) Relax primary key constraint for merge into join condition

2023-08-07 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6660:
-

 Summary: Relax primary key constraint for merge into join condition
 Key: HUDI-6660
 URL: https://issues.apache.org/jira/browse/HUDI-6660
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Lokesh Jain


Currently merge into fails when primary key is a complex key and join 
conditions do not include all the primary key columns. The Jira aims to relax 
this requirement to allow join even on a subset of primary key columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6606) Use record level index with SQL equality queries

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-6606.
---

> Use record level index with SQL equality queries
> 
>
> Key: HUDI-6606
> URL: https://issues.apache.org/jira/browse/HUDI-6606
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> With Record level support in HUDI, the sql queries related to record keys can 
> leverage the Record Index. The Jira aims to add support for equality matches 
> with record keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6606) Use record level index with SQL equality queries

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6606:
--
Fix Version/s: 0.14.0

> Use record level index with SQL equality queries
> 
>
> Key: HUDI-6606
> URL: https://issues.apache.org/jira/browse/HUDI-6606
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> With Record level support in HUDI, the sql queries related to record keys can 
> leverage the Record Index. The Jira aims to add support for equality matches 
> with record keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6651) Support IN SQL query with Record Index

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HUDI-6651.
---

> Support IN SQL query with Record Index
> --
>
> Key: HUDI-6651
> URL: https://issues.apache.org/jira/browse/HUDI-6651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently Record Index based pruning is only supported for EqualTo queries on 
> a record key. This Jira aims to add support for IN query as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6651) Support IN SQL query with Record Index

2023-08-07 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6651:
--
Fix Version/s: 0.14.0

> Support IN SQL query with Record Index
> --
>
> Key: HUDI-6651
> URL: https://issues.apache.org/jira/browse/HUDI-6651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently Record Index based pruning is only supported for EqualTo queries on 
> a record key. This Jira aims to add support for IN query as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6657) Investigate records returned when full table scan is enabled for incremental query

2023-08-06 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6657:
-

 Summary: Investigate records returned when full table scan is 
enabled for incremental query
 Key: HUDI-6657
 URL: https://issues.apache.org/jira/browse/HUDI-6657
 Project: Apache Hudi
  Issue Type: Bug
  Components: incremental-query
Reporter: Lokesh Jain


HUDI-6649 adds assertion for SQL queries with column stat index enabled. One of 
the assertions in the test class 
`org.apache.hudi.functional.TestColumnStatsIndexWithSQL#verifyFileIndexAndSQLQueries`
 is failing when full table scan is enabled. If the full table scan is 
disabled, the incremental query provides the correct results.

The Jira aims to investigate and fix the query results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6656) SQL query should also consider partition filters while querying column stats

2023-08-06 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6656:
-

 Summary: SQL query should also consider partition filters while 
querying column stats
 Key: HUDI-6656
 URL: https://issues.apache.org/jira/browse/HUDI-6656
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain


Currently only data filters are used while querying the required file slices 
from columns stats. The Jira aims to also include partition filters while 
querying columns stats.
Currently files from pruned partitions will also be included after querying 
column stats. Although those files will be filtered out in the next steps.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6651) Support IN SQL query with Record Index

2023-08-05 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6651:
-

 Summary: Support IN SQL query with Record Index
 Key: HUDI-6651
 URL: https://issues.apache.org/jira/browse/HUDI-6651
 Project: Apache Hudi
  Issue Type: Bug
  Components: index
Reporter: Lokesh Jain
Assignee: Lokesh Jain


Currently Record Index based pruning is only supported for EqualTo queries on a 
record key. This Jira aims to add support for IN query as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6649) Fix column stat based data filtering for MOR

2023-08-04 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6649:
--
Description: 
Currently MOR snapshot relation does not use the column stats index for pruning 
the files in its queries. The Jira aims to add support for pruning the file 
slices based on column stats in case of MOR.

 

  was:Currently MOR snapshot and incremental relation does not use the column 
stats index for pruning the files in its queries. The Jira aims to add support 
for pruning the file slices based on column stats in case of MOR.


> Fix column stat based data filtering for MOR
> 
>
> Key: HUDI-6649
> URL: https://issues.apache.org/jira/browse/HUDI-6649
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index, writer-core
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> Currently MOR snapshot relation does not use the column stats index for 
> pruning the files in its queries. The Jira aims to add support for pruning 
> the file slices based on column stats in case of MOR.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6649) Fix column stat based data filtering for MOR

2023-08-04 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6649:
-

 Summary: Fix column stat based data filtering for MOR
 Key: HUDI-6649
 URL: https://issues.apache.org/jira/browse/HUDI-6649
 Project: Apache Hudi
  Issue Type: Bug
  Components: index, writer-core
Reporter: Lokesh Jain
Assignee: Lokesh Jain


Currently MOR snapshot and incremental relation does not use the column stats 
index for pruning the files in its queries. The Jira aims to add support for 
pruning the file slices based on column stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6649) Fix column stat based data filtering for MOR

2023-08-04 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6649:
--
Description: Currently MOR snapshot and incremental relation does not use 
the column stats index for pruning the files in its queries. The Jira aims to 
add support for pruning the file slices based on column stats in case of MOR.  
(was: Currently MOR snapshot and incremental relation does not use the column 
stats index for pruning the files in its queries. The Jira aims to add support 
for pruning the file slices based on column stats.)

> Fix column stat based data filtering for MOR
> 
>
> Key: HUDI-6649
> URL: https://issues.apache.org/jira/browse/HUDI-6649
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index, writer-core
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>
> Currently MOR snapshot and incremental relation does not use the column stats 
> index for pruning the files in its queries. The Jira aims to add support for 
> pruning the file slices based on column stats in case of MOR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6631) RLI integration with SQL queries followup

2023-08-02 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-6631:
-

 Summary: RLI integration with SQL queries followup
 Key: HUDI-6631
 URL: https://issues.apache.org/jira/browse/HUDI-6631
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Lokesh Jain
Assignee: Lokesh Jain


HUDI-6606 adds support for EqualTo queries for simple record index. This Jira 
aims to add further support for following use cases:-
1. Query on multiple columns or integrating queries on different indices like 
column stats and RLI.
2. Support other key generator types for RLI. HUDI-6606 limits to simple record 
key.
3. Support other query types like IN, >, < etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6606) Use record level index with SQL equality queries

2023-07-28 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-6606:
--
Issue Type: Improvement  (was: Bug)

> Use record level index with SQL equality queries
> 
>
> Key: HUDI-6606
> URL: https://issues.apache.org/jira/browse/HUDI-6606
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> With Record level support in HUDI, the sql queries related to record keys can 
> leverage the Record Index. The Jira aims to add support for equality matches 
> with record keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >