[jira] [Resolved] (HUDI-7474) Functional index creation fails for an existing table as reported by community user

2024-04-10 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat resolved HUDI-7474.
---

> Functional index creation fails for an existing table as reported by 
> community user
> ---
>
> Key: HUDI-7474
> URL: https://issues.apache.org/jira/browse/HUDI-7474
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> Investigate issue reported with functional index here - 
> https://github.com/apache/hudi/issues/10110



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-10 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835624#comment-17835624
 ] 

Vinaykumar Bhat edited comment on HUDI-7580 at 4/10/24 7:29 AM:


I think the problem is that spark [rewrites the 
schema|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]
 (in the catalog) for partitioned table - the partitioning column is moved to 
the end of the schema. But 
{{[InsertIntoHoodieTableCommand::run(...)|https://github.com/apache/hudi/blob/984a248de4c783fb0d3728dff28f472fe863c9f2/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala#L57]}}
 has no special logic to understand this - it reads the schema from the catalog 
and maps the array of {{GenericInternalRow}} to the read schema.


was (Author: JIRAUSER303569):
I think the problem is that spark [rewrites the 
schema|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]
 (in the catalog) for partitioned table - the partitioning column is moved to 
the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no 
special logic to understand this - it reads the schema from the catalog and 
maps the array of {{GenericInternalRow}} to the read schema.

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1
>Reporter: Vinaykumar Bhat
>Priority: Major
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-10 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835624#comment-17835624
 ] 

Vinaykumar Bhat edited comment on HUDI-7580 at 4/10/24 7:27 AM:


I think the problem is that spark [rewrites the 
schema|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]
 (in the catalog) for partitioned table - the partitioning column is moved to 
the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no 
special logic to understand this - it reads the schema from the catalog and 
maps the array of {{GenericInternalRow}} to the read schema.


was (Author: JIRAUSER303569):
I think the problem is that spark [rewrites the 
schema|[http://example.com|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]]
 (in the catalog) for partitioned table - the partitioning column is moved to 
the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no 
special logic to understand this - it reads the schema from the catalog and 
maps the array of {{GenericInternalRow}} to the read schema.

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1
>Reporter: Vinaykumar Bhat
>Priority: Major
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-10 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835624#comment-17835624
 ] 

Vinaykumar Bhat commented on HUDI-7580:
---

I think the problem is that spark [rewrites the 
schema|[http://example.com|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]]
 (in the catalog) for partitioned table - the partitioning column is moved to 
the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no 
special logic to understand this - it reads the schema from the catalog and 
maps the array of {{GenericInternalRow}} to the read schema.

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1
>Reporter: Vinaykumar Bhat
>Priority: Major
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7582:
--
Labels: hudi-1.0.0-beta2  (was: )

> Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
> -
>
> Key: HUDI-7582
> URL: https://issues.apache.org/jira/browse/HUDI-7582
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
>
> lookupCandidateFilesInMetadataTable(...) calls 
> FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string 
> for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() 
> tries to lookup and dereference index-definition using this empty string. 
>  
> This part of the code should never have worked - hence it looks like 
> functional index (based on col-stats) is not tested on the query path. trying 
> to get the index-partition to use on the query side seems more involved - the 
> incoming query predicate needs to be parsed to get the (column-names, 
> function-name) for all the query predicate and then fetch the corresponding 
> index-partition by walking through the index-defs maintained in the 
> index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Labels: hudi-1.0.0-beta2  (was: )

> Functional index (on col stats) creation fails to process all files/partitions
> --
>
> Key: HUDI-7579
> URL: https://issues.apache.org/jira/browse/HUDI-7579
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files 
> and partitions of the table. The col-stats MDT partition ends up having an 
> entry only for subset of files that belong to the table. An example follows.
>  
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice)}}{}}}
> {code:java}
> spark.sql(
>  s"""
> |create table test_table(
> |  id int,
> |  name string,
> |  ts long,
> |  price int
> |) using hudi
> | options (
> |  primaryKey ='id',
> |  type = 'cow',
> |  preCombineField = 'ts',
> |  hoodie.metadata.record.index.enable = 'true',
> |  hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
>spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 20, 100)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 20, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>  
> {noformat}
>  
> +++---+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +++---+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +++---+
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-09 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7582:
-

 Summary: Fix NPE in 
FunctionalIndexSupport::loadFunctionalIndexDataFrame()
 Key: HUDI-7582
 URL: https://issues.apache.org/jira/browse/HUDI-7582
 Project: Apache Hudi
  Issue Type: Bug
  Components: index
Reporter: Vinaykumar Bhat


lookupCandidateFilesInMetadataTable(...) calls 
FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string for 
indexPartition which results in a NPE as loadFunctionalIndexDataFrame() tries 
to lookup and dereference index-definition using this empty string. 
 
This part of the code should never have worked - hence it looks like functional 
index (based on col-stats) is not tested on the query path. trying to get the 
index-partition to use on the query side seems more involved - the incoming 
query predicate needs to be parsed to get the (column-names, function-name) for 
all the query predicate and then fetch the corresponding index-partition by 
walking through the index-defs maintained in the index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-04-09 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835348#comment-17835348
 ] 

Vinaykumar Bhat edited comment on HUDI-7007 at 4/9/24 11:45 AM:


I am not sure if creating functional indexes (using col stats) works correctly. 
For example, creating a functional index on a table with three existing files 
fails to process all the files. The col-stats in the MDT is created for only 
one of the file. HUDI-7579 has more details.


was (Author: JIRAUSER303569):
I am not sure if creating functional indexes (using col stats) works corerctly. 
For example, on creating a functional index on a table with three existing 
files fails to process all the files. The col-stats in the MDT is created for 
only one of the file. HUDI-7579 has more details.

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-04-09 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835348#comment-17835348
 ] 

Vinaykumar Bhat commented on HUDI-7007:
---

I am not sure if creating functional indexes (using col stats) works corerctly. 
For example, on creating a functional index on a table with three existing 
files fails to process all the files. The col-stats in the MDT is created for 
only one of the file. HUDI-7579 has more details.

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7580:
--
Description: 
Came across this behaviour of partitioned tables when trying to debug some 
other issue with functional-index. It seems that the column ordering gets 
messed up while inserting records into a hudi table. Hence, a subsequent query 
returns wrong results. An example follows:

 

The following is a scala test:
{code:java}
  test("Test Create Functional Index") {
if (HoodieSparkUtils.gteqSpark3_2) {
  withTempDir { tmp =>
val tableType = "cow"
  val tableName = "rides"
  val basePath = s"${tmp.getCanonicalPath}/$tableName"
  spark.sql("set hoodie.metadata.enable=true")
  spark.sql(
s"""
   |create table $tableName (
   |  id int,
   |  name string,
   |  price int,
   |  ts long
   |) using hudi
   | options (
   |  primaryKey ='id',
   |  type = '$tableType',
   |  preCombineField = 'ts',
   |  hoodie.metadata.record.index.enable = 'true',
   |  hoodie.datasource.write.recordkey.field = 'id'
   | )
   | partitioned by(price)
   | location '$basePath'
   """.stripMargin)
  spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
'a1', 10, 1000)")
  spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
'a2', 100, 20)")
  spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
'a3', 1000, 20)")

  spark.sql(s"select id, name, price, ts from $tableName").show(false)
  }
}
  } {code}
 

The query returns the following result (note how *price* and *ts* columns are 
mixed up). 
{code:java}
+---++--++
|id |name|price |ts  |
+---++--++
|3  |a3  |20|1000|
|2  |a2  |20|100 |
|1  |a1  |1000  |10  |
+---++--++
 {code}
 

Having the partition column as the last column in the schema does not cause 
this problem. If the mixed-up columns are of incompatible datatypes, then the 
insert fails with an error.

  was:
Came across this behaviour of partitioned tables when trying to debug some 
other issue with functional-index. It seems that the column ordering gets 
messed up while inserting records into a hudi table. Hence, a subsequent query 
returns wrong results. An example follows:

 

The following is a scala test:
{code:java}
  test("Test Create Functional Index") {
if (HoodieSparkUtils.gteqSpark3_2) {
  withTempDir { tmp =>
val tableType = "cow"
  val tableName = "rides"
  val basePath = s"${tmp.getCanonicalPath}/$tableName"
  spark.sql("set hoodie.metadata.enable=true")
  spark.sql(
s"""
   |create table $tableName (
   |  id int,
   |  name string,
   |  price int,
   |  ts long
   |) using hudi
   | options (
   |  primaryKey ='id',
   |  type = '$tableType',
   |  preCombineField = 'ts',
   |  hoodie.metadata.record.index.enable = 'true',
   |  hoodie.datasource.write.recordkey.field = 'id'
   | )
   | partitioned by(price)
   | location '$basePath'
   """.stripMargin)
  spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
'a1', 10, 1000)")
  spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
'a2', 100, 20)")
  spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
'a3', 1000, 20)")

  spark.sql(s"select id, name, price, ts from $tableName").show(false)
  }
}
  } {code}
 

The query returns the following result (note how price ans ts columns are mixed 
up). 
{code:java}
+---++--++
|id |name|price |ts  |
+---++--++
|3  |a3  |20|1000|
|2  |a2  |20|100 |
|1  |a1  |1000  |10  |
+---++--++
 {code}
 

Have the partition column as the last column in the schema does not cause this 
problem. If the mixed-up columns are of imcompatible datatypes, then the insert 
fails with an error.


> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1
>Reporter: Vinaykumar Bhat
>Priority: Major
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that 

[jira] [Created] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-09 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7580:
-

 Summary: Inserting rows into partitioned table leads to data 
sanity issues
 Key: HUDI-7580
 URL: https://issues.apache.org/jira/browse/HUDI-7580
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 1.0.0-beta1
Reporter: Vinaykumar Bhat


Came across this behaviour of partitioned tables when trying to debug some 
other issue with functional-index. It seems that the column ordering gets 
messed up while inserting records into a hudi table. Hence, a subsequent query 
returns wrong results. An example follows:

 

The following is a scala test:
{code:java}
  test("Test Create Functional Index") {
if (HoodieSparkUtils.gteqSpark3_2) {
  withTempDir { tmp =>
val tableType = "cow"
  val tableName = "rides"
  val basePath = s"${tmp.getCanonicalPath}/$tableName"
  spark.sql("set hoodie.metadata.enable=true")
  spark.sql(
s"""
   |create table $tableName (
   |  id int,
   |  name string,
   |  price int,
   |  ts long
   |) using hudi
   | options (
   |  primaryKey ='id',
   |  type = '$tableType',
   |  preCombineField = 'ts',
   |  hoodie.metadata.record.index.enable = 'true',
   |  hoodie.datasource.write.recordkey.field = 'id'
   | )
   | partitioned by(price)
   | location '$basePath'
   """.stripMargin)
  spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
'a1', 10, 1000)")
  spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
'a2', 100, 20)")
  spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
'a3', 1000, 20)")

  spark.sql(s"select id, name, price, ts from $tableName").show(false)
  }
}
  } {code}
 

The query returns the following result (note how price ans ts columns are mixed 
up). 
{code:java}
+---++--++
|id |name|price |ts  |
+---++--++
|3  |a3  |20|1000|
|2  |a2  |20|100 |
|1  |a1  |1000  |10  |
+---++--++
 {code}
 

Have the partition column as the last column in the schema does not cause this 
problem. If the mixed-up columns are of imcompatible datatypes, then the insert 
fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
  Component/s: index
Epic Link: HUDI-512
Fix Version/s: 1.0.0

> Functional index (on col stats) creation fails to process all files/partitions
> --
>
> Key: HUDI-7579
> URL: https://issues.apache.org/jira/browse/HUDI-7579
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
> Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files 
> and partitions of the table. The col-stats MDT partition ends up having an 
> entry only for subset of files that belong to the table. An example follows.
>  
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice)}}{}}}
> {code:java}
> spark.sql(
>  s"""
> |create table test_table(
> |  id int,
> |  name string,
> |  ts long,
> |  price int
> |) using hudi
> | options (
> |  primaryKey ='id',
> |  type = 'cow',
> |  preCombineField = 'ts',
> |  hoodie.metadata.record.index.enable = 'true',
> |  hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
>spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 20, 100)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 20, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>  
> {noformat}
>  
> +++---+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +++---+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +++---+
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
Creating a functional index on an existing table fails to process all files and 
partitions of the table. The col-stats MDT partition ends up having an entry 
only for subset of files that belong to the table. An example follows.

 

The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null,

{1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
{1970-01-01}

, null, null, null, null}, 1, 0, 434874, 869748, false}*

 
{noformat}
 
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
{noformat}
 
 

  was:
Creating a functional index on an existing table fails to process all files and 
partitions of the table. The col-stats MDT partition ends up having an entry 
only for subset of files that belong to the table. An example follows.

 

The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) 

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
Creating a functional index on an existing table fails to process all files and 
partitions of the table. The col-stats MDT partition ends up having an entry 
only for subset of files that belong to the table. An example follows.

 

The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null,

{1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
{1970-01-01}

, null, null, null, null}, 1, 0, 434874, 869748, false}*
{code:java}
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
 {code}
 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single 

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}*
{code:java}
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
 {code}
 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = '$tableType',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}

Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, 

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = '$tableType',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}

Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('$tableName') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}*
{code:java}
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
 {code}

 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice){{{}{}}}

```
spark.sql(s"""create table test_table (id int, name string, ts long, price int) 
using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)

spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 
10)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 
20, 100)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 
20, 1000)")
```
 
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
```
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)

spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false)
```
 
As seen below, col-stats has only one entry for one of the file (and 

[jira] [Created] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7579:
-

 Summary: Functional index (on col stats) creation fails to process 
all files/partitions
 Key: HUDI-7579
 URL: https://issues.apache.org/jira/browse/HUDI-7579
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat


The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice){{{}{}}}

```
spark.sql(s"""create table test_table (id int, name string, ts long, price int) 
using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)

spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 
10)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 
20, 100)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 
20, 1000)")
```
 
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
```
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)

spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false)
```
 
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
`\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}`
{{+++---+}}
{{|key                                             |type|ColumnStatsMetadata    
                                                                                
                                                                                
                                                                            |}}
{{+++---+}}
{{|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|}}
{{+++---+}}
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7569:
--
Sprint: Sprint 2024-03-25

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7559) Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...)

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7559:
--
Story Points: 4

> Fix functional index (on column stats): Handle NPE in 
> filterQueriesWithRecordKey(...)
> -
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is 
> then subsequently ignored by `lookupCandidateFilesInMetadataTable()` 
> rendering every other index (like FunctionalIndex, ColStat Index) to not be 
> used for data skipping (i.e pruning files)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7569:
-

Story Points: 4
Assignee: Vinaykumar Bhat

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7569:
--
Epic Link: HUDI-512

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7569:
--
Fix Version/s: 1.0.0

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Priority: Major
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7569:
--
Labels: hudi-1.0.0-beta2  (was: )

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-02 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7569:
-

 Summary: Fix wrong result while using RLI for pruning files
 Key: HUDI-7569
 URL: https://issues.apache.org/jira/browse/HUDI-7569
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat


Data skipping (pruning files) for RLI is supported only when the query 
predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
However, the logic for detecting valid `In` expression/filter on record-key has 
bugs. It tries to prune files assuming that `In` expression/filter can 
reference only record-key column even when the `In` query is based on other 
columns.

 

For example, a query of the foem `select * from trips_table where driver in 
('abc', 'xyz')` has the potential to return wrong results if the record-key for 
this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7563) Implement DROP INDEX support

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7563:
--
Fix Version/s: 1.0.0

> Implement DROP INDEX support
> 
>
> Key: HUDI-7563
> URL: https://issues.apache.org/jira/browse/HUDI-7563
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> DROP INDEX is not supported for functional index and secondary index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7563) Implement DROP INDEX support

2024-04-02 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7563:
-

 Summary: Implement DROP INDEX support
 Key: HUDI-7563
 URL: https://issues.apache.org/jira/browse/HUDI-7563
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Vinaykumar Bhat


DROP INDEX is not supported for functional index and secondary index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7563) Implement DROP INDEX support

2024-04-02 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7563:
--
Labels: hudi-1.0.0-beta2  (was: )

> Implement DROP INDEX support
> 
>
> Key: HUDI-7563
> URL: https://issues.apache.org/jira/browse/HUDI-7563
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
>
> DROP INDEX is not supported for functional index and secondary index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7559) Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...)

2024-04-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7559:
--
Description: `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` 
throws NPE which is then subsequently ignored by 
`lookupCandidateFilesInMetadataTable()` rendering every other index (like 
FunctionalIndex, ColStat Index) to not be used for data skipping (i.e pruning 
files)  (was: `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws 
NPE which is then subsequently `lookupCandidateFilesInMetadataTable()` 
rendering every other index (like FunctionalIndex, ColStat Index) to not be 
used for data skipping (i.e pruning files))

> Fix functional index (on column stats): Handle NPE in 
> filterQueriesWithRecordKey(...)
> -
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is 
> then subsequently ignored by `lookupCandidateFilesInMetadataTable()` 
> rendering every other index (like FunctionalIndex, ColStat Index) to not be 
> used for data skipping (i.e pruning files)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7559) Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...)

2024-04-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7559:
--
Description: `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` 
throws NPE which is then subsequently `lookupCandidateFilesInMetadataTable()` 
rendering every other index (like FunctionalIndex, ColStat Index) to not be 
used for data skipping (i.e pruning files)
Summary: Fix functional index (on column stats): Handle NPE in 
filterQueriesWithRecordKey(...)  (was: Fix issues with functional index (on 
column stats) based pruning)

> Fix functional index (on column stats): Handle NPE in 
> filterQueriesWithRecordKey(...)
> -
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is 
> then subsequently `lookupCandidateFilesInMetadataTable()` rendering every 
> other index (like FunctionalIndex, ColStat Index) to not be used for data 
> skipping (i.e pruning files)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7559) Fix issues with functional index (on column stats) based pruning

2024-04-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7559:
--
Status: In Progress  (was: Open)

> Fix issues with functional index (on column stats) based pruning
> 
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7559) Fix issues with functional index (on column stats) based pruning

2024-04-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7559:
--
Epic Link: HUDI-512

> Fix issues with functional index (on column stats) based pruning
> 
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7559) Fix issues with functional index (on column stats) based pruning

2024-04-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7559:
--
Fix Version/s: 1.0.0

> Fix issues with functional index (on column stats) based pruning
> 
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7559) Fix issues with functional index (on column stats) based pruning

2024-04-01 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7559:
-

 Summary: Fix issues with functional index (on column stats) based 
pruning
 Key: HUDI-7559
 URL: https://issues.apache.org/jira/browse/HUDI-7559
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7559) Fix issues with functional index (on column stats) based pruning

2024-04-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7559:
-

Assignee: Vinaykumar Bhat

> Fix issues with functional index (on column stats) based pruning
> 
>
> Key: HUDI-7559
> URL: https://issues.apache.org/jira/browse/HUDI-7559
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-03-28 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831645#comment-17831645
 ] 

Vinaykumar Bhat edited comment on HUDI-7007 at 3/28/24 6:46 AM:


Seems like `
{_}FunctionalIndexSupport::loadFunctionalIndexDataFrame(...){_}` is always 
called (from `{_}HoodieFileIndex::lookupCandidateFilesInMetadataTable(...){_}` 
with an empty `{_}indexPartition{_}` string. So, it is likely that file pruning 
based on functional index is not supported.


was (Author: JIRAUSER303569):
Seems like `
FunctionalIndexSupport::loadFunctionalIndexDataFrame(...)` is always called 
(from `HoodieFileIndex::lookupCandidateFilesInMetadataTable(...)` with an empty 
`indexPartition` string. So, it is likely that file pruning based on functional 
index is not supported.

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-03-28 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831645#comment-17831645
 ] 

Vinaykumar Bhat commented on HUDI-7007:
---

Seems like `
FunctionalIndexSupport::loadFunctionalIndexDataFrame(...)` is always called 
(from `HoodieFileIndex::lookupCandidateFilesInMetadataTable(...)` with an empty 
`indexPartition` string. So, it is likely that file pruning based on functional 
index is not supported.

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-03-27 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831243#comment-17831243
 ] 

Vinaykumar Bhat commented on HUDI-7007:
---

[~codope] - need some pointers on this. Are there any tests that executes a 
query resulting in pruning files based on functional index? I saw 
`TestFunctionalIndex.scala`, but none of the tests there seem to have such a 
query.

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7484) Fix partitioning style when partition is inferred from partitionBy

2024-03-26 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831167#comment-17831167
 ] 

Vinaykumar Bhat commented on HUDI-7484:
---

[~codope] Do you have some pointers or a test case? I want to understand what 
this is about and how to proceed.

> Fix partitioning style when partition is inferred from partitionBy
> --
>
> Key: HUDI-7484
> URL: https://issues.apache.org/jira/browse/HUDI-7484
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vinaykumar Bhat
>Priority: Major
> Fix For: 1.0.0
>
>
> When inferring partition from partitionBy() arguments and hive style 
> partitioning is enabled, we observe that the partitioining style is not 
> uniformed for multi-level partition. Directory structure is as follows:
> partition=2015
>                        |- 03
>                              |- 15
>                              |- 16



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7117) Functional index creation not working when table is created using datasource writer

2024-03-26 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830973#comment-17830973
 ] 

Vinaykumar Bhat edited comment on HUDI-7117 at 3/26/24 2:54 PM:


This is likely not an issue, but a gap in understanding the feature.

 

The issue is that 
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} 
creates a temporary view (similar to the one that is created using {{{}CREATE 
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. 
Hence the following {{CREATE INDEX ...}} statement to create a functional fails 
as the object on which the index is being created is not a hudi managed table.

Instead of creating a temporary view, one can use {{saveAsTable(...)}} method 
on the DataFrameWriter object to create a hudi managed table and then create 
functional index on those tables. An example follows:
{code:java}
val columns = Seq("ts", "transaction_id", "rider", "driver", "price", 
"location")

val data = Seq(
(1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 
19.10, "san_francisco"),
(1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 
27.70, "san_francisco"),
(1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 
33.90, "san_francisco"),
(1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 
34.15, "sao_paulo"),
(169511511L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 
17.85, "chennai"));
var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
  option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
  option(HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.datasource.write.recordkey.field", "transaction_id").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name()).
  option("hoodie.table.metadata.enable", "true").
  option("hoodie.parquet.small.file.limit", "0").
  option("path", "/tmp/temp_table_path/").
  mode(SaveMode.Append).
  saveAsTable("temp_table")

spark.catalog.listTables().show(false)
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')")
{code}
 


was (Author: JIRAUSER303569):
This is likely not an issue, but a gap in understanding the feature.

 

The issue is that 
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} 
creates a temporary view (similar to the one that is created using {{{}CREATE 
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. 
Hence the following {{CREATE INDEX ...}} statement to create a functional fails 
as the object on which the index is being created is not a hudi managed table.

Instead of creating a temporary view, one can use {{saveAsTable(...)}} method 
on the DataFrameWriter object to create a hudi managed table and then create 
functional index on those tables. An example follows:
val columns = Seq("ts", "transaction_id", "rider", "driver", "price", 
"location")
val data =
  Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", 
"driver-K", 19.10, "san_francisco"),
(1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", 
"driver-M", 27.70, "san_francisco"),
(1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", 
"driver-L", 33.90, "san_francisco"),
(1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", 
"driver-P", 34.15, "sao_paulo"),
(169511511L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", 
"driver-T", 17.85, "chennai"));

var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
  option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
  option(HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.datasource.write.recordkey.field", "transaction_id").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name()).
  option("hoodie.table.metadata.enable", "true").
  option("hoodie.parquet.small.file.limit", "0").
  option("path", "/tmp/temp_table_path/").
  mode(SaveMode.Append).
  saveAsTable("temp_table")
spark.catalog.listTables().show(false)
spark.sql(s"select from_unixtime(ts, '-MM-dd') as datestr FROM 
temp_table").show()
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')")

> Functional index creation not working when table is created using datasource 
> writer
> ---

[jira] [Commented] (HUDI-7117) Functional index creation not working when table is created using datasource writer

2024-03-26 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830973#comment-17830973
 ] 

Vinaykumar Bhat commented on HUDI-7117:
---

This is likely not an issue, but a gap in understanding the feature.

 

The issue is that 
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} 
creates a temporary view (similar to the one that is created using {{{}CREATE 
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. 
Hence the following {{CREATE INDEX ...}} statement to create a functional fails 
as the object on which the index is being created is not a hudi managed table.

Instead of creating a temporary view, one can use {{saveAsTable(...)}} method 
on the DataFrameWriter object to create a hudi managed table and then create 
functional index on those tables. An example follows:
val columns = Seq("ts", "transaction_id", "rider", "driver", "price", 
"location")
val data =
  Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", 
"driver-K", 19.10, "san_francisco"),
(1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", 
"driver-M", 27.70, "san_francisco"),
(1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", 
"driver-L", 33.90, "san_francisco"),
(1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", 
"driver-P", 34.15, "sao_paulo"),
(169511511L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", 
"driver-T", 17.85, "chennai"));

var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
  option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
  option(HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.datasource.write.recordkey.field", "transaction_id").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name()).
  option("hoodie.table.metadata.enable", "true").
  option("hoodie.parquet.small.file.limit", "0").
  option("path", "/tmp/temp_table_path/").
  mode(SaveMode.Append).
  saveAsTable("temp_table")
spark.catalog.listTables().show(false)
spark.sql(s"select from_unixtime(ts, '-MM-dd') as datestr FROM 
temp_table").show()
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')")

> Functional index creation not working when table is created using datasource 
> writer
> ---
>
> Key: HUDI-7117
> URL: https://issues.apache.org/jira/browse/HUDI-7117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Aditya Goenka
>Assignee: Vinaykumar Bhat
>Priority: Blocker
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Details and Reproducible code under Github Issue - 
> [https://github.com/apache/hudi/issues/10110]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7117) Functional index creation not working when table is created using datasource writer

2024-03-26 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7117:
--
Status: In Progress  (was: Open)

> Functional index creation not working when table is created using datasource 
> writer
> ---
>
> Key: HUDI-7117
> URL: https://issues.apache.org/jira/browse/HUDI-7117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Aditya Goenka
>Assignee: Vinaykumar Bhat
>Priority: Blocker
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Details and Reproducible code under Github Issue - 
> [https://github.com/apache/hudi/issues/10110]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times

2024-03-25 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7480:
--
Status: Patch Available  (was: In Progress)

> initializeFunctionalIndexPartition is called multiple times
> ---
>
> Key: HUDI-7480
> URL: https://issues.apache.org/jira/browse/HUDI-7480
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This is due to a issue in 
> initializeFromFilesystem(), which tries to check if MDT partition needs to be 
> initialized based on the absence of partition-type. But for functional index, 
> partition-type actually store the prefix (func_index_)- hence the check 
> always fails and we try to reinit the same functional index partition again.
>  
> Simple test:
> {quote}spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(ts)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
> spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)")
> spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)")
>  
> var createIndexSql = s"create index idx_datestr on $tableName using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
>  
> -- This insert throws null-pointer exception
> spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times

2024-03-25 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reopened HUDI-7480:
---

> initializeFunctionalIndexPartition is called multiple times
> ---
>
> Key: HUDI-7480
> URL: https://issues.apache.org/jira/browse/HUDI-7480
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This is due to a issue in 
> initializeFromFilesystem(), which tries to check if MDT partition needs to be 
> initialized based on the absence of partition-type. But for functional index, 
> partition-type actually store the prefix (func_index_)- hence the check 
> always fails and we try to reinit the same functional index partition again.
>  
> Simple test:
> {quote}spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(ts)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
> spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)")
> spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)")
>  
> var createIndexSql = s"create index idx_datestr on $tableName using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
>  
> -- This insert throws null-pointer exception
> spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times

2024-03-25 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7480:
--
Status: In Progress  (was: Reopened)

> initializeFunctionalIndexPartition is called multiple times
> ---
>
> Key: HUDI-7480
> URL: https://issues.apache.org/jira/browse/HUDI-7480
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This is due to a issue in 
> initializeFromFilesystem(), which tries to check if MDT partition needs to be 
> initialized based on the absence of partition-type. But for functional index, 
> partition-type actually store the prefix (func_index_)- hence the check 
> always fails and we try to reinit the same functional index partition again.
>  
> Simple test:
> {quote}spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(ts)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
> spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)")
> spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)")
>  
> var createIndexSql = s"create index idx_datestr on $tableName using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
>  
> -- This insert throws null-pointer exception
> spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times

2024-03-25 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat resolved HUDI-7480.
---

> initializeFunctionalIndexPartition is called multiple times
> ---
>
> Key: HUDI-7480
> URL: https://issues.apache.org/jira/browse/HUDI-7480
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This is due to a issue in 
> initializeFromFilesystem(), which tries to check if MDT partition needs to be 
> initialized based on the absence of partition-type. But for functional index, 
> partition-type actually store the prefix (func_index_)- hence the check 
> always fails and we try to reinit the same functional index partition again.
>  
> Simple test:
> {quote}spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(ts)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
> spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)")
> spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)")
>  
> var createIndexSql = s"create index idx_datestr on $tableName using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
>  
> -- This insert throws null-pointer exception
> spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7504) Replace expensive file existance check (in object store) with spark options

2024-03-20 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829355#comment-17829355
 ] 

Vinaykumar Bhat commented on HUDI-7504:
---

Support for these configs (as part of DataSourceOptions) in spark was added 
only in 3.4.0. Will hold on to the PR for now.

> Replace expensive file existance check (in object store) with spark options
> ---
>
> Key: HUDI-7504
> URL: https://issues.apache.org/jira/browse/HUDI-7504
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
>
> The incremental loading from S3 and GCS performs a existence check for an 
> object. This is expensive. This happens 
> [here|[http://example.com|https://github.com/apache/hudi/blob/130498708bb1cd5da1d0e725971b3d721eeef231/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java#L161]]
>  
> This can be replaced with spark provided options:
> spark.sql.files.ignoreMissingFiles
> spark.sql.files.ignoreCorruptFiles
>  
> Ref for these options: 
> [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7512) Support sorting of input records in insert operation

2024-03-18 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7512:
-

Assignee: Vinaykumar Bhat

> Support sorting of input records in insert operation
> 
>
> Key: HUDI-7512
> URL: https://issues.apache.org/jira/browse/HUDI-7512
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7512) Support sorting of input records in insert operation

2024-03-18 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7512:
-

 Summary: Support sorting of input records in insert operation
 Key: HUDI-7512
 URL: https://issues.apache.org/jira/browse/HUDI-7512
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Vinaykumar Bhat






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7504) Replace expensive file existance check (in object store) with spark options

2024-03-14 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7504:
-

 Summary: Replace expensive file existance check (in object store) 
with spark options
 Key: HUDI-7504
 URL: https://issues.apache.org/jira/browse/HUDI-7504
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Vinaykumar Bhat


The incremental loading from S3 and GCS performs a existence check for an 
object. This is expensive. This happens 
[here|[http://example.com|https://github.com/apache/hudi/blob/130498708bb1cd5da1d0e725971b3d721eeef231/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java#L161]]
 
This can be replaced with spark provided options:
spark.sql.files.ignoreMissingFiles
spark.sql.files.ignoreCorruptFiles
 
Ref for these options: 
[https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files]
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7504) Replace expensive file existance check (in object store) with spark options

2024-03-14 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7504:
-

Assignee: Vinaykumar Bhat

> Replace expensive file existance check (in object store) with spark options
> ---
>
> Key: HUDI-7504
> URL: https://issues.apache.org/jira/browse/HUDI-7504
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> The incremental loading from S3 and GCS performs a existence check for an 
> object. This is expensive. This happens 
> [here|[http://example.com|https://github.com/apache/hudi/blob/130498708bb1cd5da1d0e725971b3d721eeef231/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java#L161]]
>  
> This can be replaced with spark provided options:
> spark.sql.files.ignoreMissingFiles
> spark.sql.files.ignoreCorruptFiles
>  
> Ref for these options: 
> [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times

2024-03-05 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7480:
-

Assignee: Sagar Sumit

> initializeFunctionalIndexPartition is called multiple times
> ---
>
> Key: HUDI-7480
> URL: https://issues.apache.org/jira/browse/HUDI-7480
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>
> This is due to a issue in 
> initializeFromFilesystem(), which tries to check if MDT partition needs to be 
> initialized based on the absence of partition-type. But for functional index, 
> partition-type actually store the prefix (func_index_)- hence the check 
> always fails and we try to reinit the same functional index partition again.
>  
> Simple test:
> {quote}spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(ts)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
> spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)")
> spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)")
>  
> var createIndexSql = s"create index idx_datestr on $tableName using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
>  
> -- This insert throws null-pointer exception
> spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times

2024-03-05 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7480:
-

 Summary: initializeFunctionalIndexPartition is called multiple 
times
 Key: HUDI-7480
 URL: https://issues.apache.org/jira/browse/HUDI-7480
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat


This is due to a issue in 
initializeFromFilesystem(), which tries to check if MDT partition needs to be 
initialized based on the absence of partition-type. But for functional index, 
partition-type actually store the prefix (func_index_)- hence the check always 
fails and we try to reinit the same functional index partition again.
 
Simple test:
{quote}spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long
|) using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(ts)
| location '$basePath'
""".stripMargin)
spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)")
spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)")
 
var createIndexSql = s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
spark.sql(createIndexSql)
 
-- This insert throws null-pointer exception
spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7474) Functional index creation fails for an existing table as reported by community user

2024-03-03 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7474:
-

Assignee: Vinaykumar Bhat

> Functional index creation fails for an existing table as reported by 
> community user
> ---
>
> Key: HUDI-7474
> URL: https://issues.apache.org/jira/browse/HUDI-7474
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> Investigate issue reported with functional index here - 
> https://github.com/apache/hudi/issues/10110



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7474) Functional index creation fails for an existing table as reported by community user

2024-03-03 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7474:
-

 Summary: Functional index creation fails for an existing table as 
reported by community user
 Key: HUDI-7474
 URL: https://issues.apache.org/jira/browse/HUDI-7474
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat


Investigate issue reported with functional index here - 
https://github.com/apache/hudi/issues/10110



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7472) Creating a functional index implicitly drops metadata RLI partition

2024-03-03 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7472:
--
Affects Version/s: 1.0.0

> Creating a functional index implicitly drops metadata RLI partition
> ---
>
> Key: HUDI-7472
> URL: https://issues.apache.org/jira/browse/HUDI-7472
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> This is because of a bug in generating write-config for the index creation 
> which does not set the relevent fields for enabling RLI. The metadata writer 
> creating code path in `HudiTable` ends up dropping the metadata partitions 
> for RLI, bloom and col-stats because it assumes the current 'write-config' 
> has disabled it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7146) Implement secondary index

2024-03-03 Thread Vinaykumar Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822959#comment-17822959
 ] 

Vinaykumar Bhat commented on HUDI-7146:
---

Issue found during testing functional-index based configs

 

> Implement secondary index
> -
>
> Key: HUDI-7146
> URL: https://issues.apache.org/jira/browse/HUDI-7146
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> # Secondary index schema should be flexible enough to accommodate various 
> kinds of secondary index. 
>  # Reuse as much as possible the existing framework for indexing.
>  # Merge with existing index config and introduce as less configs as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7472) Creating a functional index implicitly drops metadata RLI partition

2024-03-03 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7472:
-

 Summary: Creating a functional index implicitly drops metadata RLI 
partition
 Key: HUDI-7472
 URL: https://issues.apache.org/jira/browse/HUDI-7472
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat


This is because of a bug in generating write-config for the index creation 
which does not set the relevent fields for enabling RLI. The metadata writer 
creating code path in `HudiTable` ends up dropping the metadata partitions for 
RLI, bloom and col-stats because it assumes the current 'write-config' has 
disabled it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7472) Creating a functional index implicitly drops metadata RLI partition

2024-03-03 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7472:
-

Assignee: Vinaykumar Bhat

> Creating a functional index implicitly drops metadata RLI partition
> ---
>
> Key: HUDI-7472
> URL: https://issues.apache.org/jira/browse/HUDI-7472
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> This is because of a bug in generating write-config for the index creation 
> which does not set the relevent fields for enabling RLI. The metadata writer 
> creating code path in `HudiTable` ends up dropping the metadata partitions 
> for RLI, bloom and col-stats because it assumes the current 'write-config' 
> has disabled it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7458) Creating multiple functional index fails

2024-03-01 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7458:
--
Reviewers: Sagar Sumit

> Creating multiple functional index fails
> 
>
> Key: HUDI-7458
> URL: https://issues.apache.org/jira/browse/HUDI-7458
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
>
> Looks like an issue in `
> HoodieSparkFunctionalIndexClient::create(...)` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7458) Creating multiple functional index fails

2024-02-29 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7458:
-

 Summary: Creating multiple functional index fails
 Key: HUDI-7458
 URL: https://issues.apache.org/jira/browse/HUDI-7458
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinaykumar Bhat


Looks like an issue in `
HoodieSparkFunctionalIndexClient::create(...)` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7458) Creating multiple functional index fails

2024-02-29 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7458:
-

Assignee: Vinaykumar Bhat

> Creating multiple functional index fails
> 
>
> Key: HUDI-7458
> URL: https://issues.apache.org/jira/browse/HUDI-7458
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> Looks like an issue in `
> HoodieSparkFunctionalIndexClient::create(...)` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7405) Implement reader path support for secondary index

2024-02-12 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7405:
-

Assignee: Vinaykumar Bhat

> Implement reader path support for secondary index
> -
>
> Key: HUDI-7405
> URL: https://issues.apache.org/jira/browse/HUDI-7405
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7405) Implement reader path support for secondary index

2024-02-12 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7405:
-

 Summary: Implement reader path support for secondary index
 Key: HUDI-7405
 URL: https://issues.apache.org/jira/browse/HUDI-7405
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinaykumar Bhat






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7384) Implement writer path support for secondary index

2024-02-05 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7384:
-

 Summary: Implement writer path support for secondary index
 Key: HUDI-7384
 URL: https://issues.apache.org/jira/browse/HUDI-7384
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinaykumar Bhat


# Basic initialization ona. existing table
 # Handle inserts/upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7384) Implement writer path support for secondary index

2024-02-05 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7384:
-

Assignee: Vinaykumar Bhat

> Implement writer path support for secondary index
> -
>
> Key: HUDI-7384
> URL: https://issues.apache.org/jira/browse/HUDI-7384
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> # Basic initialization ona. existing table
>  # Handle inserts/upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF

2024-01-16 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7301:
-

Assignee: Vinaykumar Bhat

> Update hudi docs/websites with documentation for the new spark TVF
> --
>
> Key: HUDI-7301
> URL: https://issues.apache.org/jira/browse/HUDI-7301
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> Hudi documentation and website needs to be updated to reflect the support for 
> new spark-sql related table-valued-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF

2024-01-16 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7301:
-

 Summary: Update hudi docs/websites with documentation for the new 
spark TVF
 Key: HUDI-7301
 URL: https://issues.apache.org/jira/browse/HUDI-7301
 Project: Apache Hudi
  Issue Type: Task
Reporter: Vinaykumar Bhat


Hudi documentation and website needs to be updated to reflect the support for 
new spark-sql related table-valued-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7294) Add TVF to query hudi metadata

2024-01-11 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7294:
-

 Summary: Add TVF to query hudi metadata
 Key: HUDI-7294
 URL: https://issues.apache.org/jira/browse/HUDI-7294
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Vinaykumar Bhat
Assignee: Vinaykumar Bhat


Having a table valued function to query hudi metadata for a given table through 
spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7243) Add TVF to query hudi timeline through spark-sql

2024-01-03 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat closed HUDI-7243.
-
Resolution: Fixed

> Add TVF to query hudi timeline through spark-sql
> 
>
> Key: HUDI-7243
> URL: https://issues.apache.org/jira/browse/HUDI-7243
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Having a table valued function to query hudi timeline for a given table 
> through spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7243) Add TVF to query hudi timeline through spark-sql

2024-01-03 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat resolved HUDI-7243.
---

> Add TVF to query hudi timeline through spark-sql
> 
>
> Key: HUDI-7243
> URL: https://issues.apache.org/jira/browse/HUDI-7243
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Having a table valued function to query hudi timeline for a given table 
> through spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7261) Add TVF to query hudi file system view through spark-sql

2023-12-26 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7261:
-

 Summary: Add TVF to query hudi file system view through spark-sql
 Key: HUDI-7261
 URL: https://issues.apache.org/jira/browse/HUDI-7261
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Vinaykumar Bhat
Assignee: Vinaykumar Bhat


Having a table valued function to query hudi table's file system view  through 
spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7243) Add TVF to query hudi timeline through spark-sql

2023-12-20 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7243:
-

Assignee: Vinaykumar Bhat

> Add TVF to query hudi timeline through spark-sql
> 
>
> Key: HUDI-7243
> URL: https://issues.apache.org/jira/browse/HUDI-7243
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
> Fix For: 1.0.0
>
>
> Having a table valued function to query hudi timeline for a given table 
> through spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7243) Add TVF to query hudi timeline through spark-sql

2023-12-20 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7243:
-

 Summary: Add TVF to query hudi timeline through spark-sql
 Key: HUDI-7243
 URL: https://issues.apache.org/jira/browse/HUDI-7243
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Vinaykumar Bhat
 Fix For: 1.0.0


Having a table valued function to query hudi timeline for a given table through 
spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)