[jira] [Assigned] (HUDI-7494) multi writer sync partition to glue will missing some partitions

2024-03-08 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-7494:
---

Assignee: nicolas paris

> multi writer sync partition to glue will missing some partitions
> 
>
> Key: HUDI-7494
> URL: https://issues.apache.org/jira/browse/HUDI-7494
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> Glue still is affected during multi writers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7494) multi writer sync partition to glue will missing some partitions

2024-03-08 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-7494:

Issue Type: Bug  (was: Test)

> multi writer sync partition to glue will missing some partitions
> 
>
> Key: HUDI-7494
> URL: https://issues.apache.org/jira/browse/HUDI-7494
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Priority: Major
>
> Glue still is affected during multi writers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7494) multi writer sync partition to glue will missing some partitions

2024-03-08 Thread nicolas paris (Jira)
nicolas paris created HUDI-7494:
---

 Summary: multi writer sync partition to glue will missing some 
partitions
 Key: HUDI-7494
 URL: https://issues.apache.org/jira/browse/HUDI-7494
 Project: Apache Hudi
  Issue Type: Test
Reporter: nicolas paris


Glue still is affected during multi writers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7399) Add IT tests for hudi-aws sync

2024-02-09 Thread nicolas paris (Jira)
nicolas paris created HUDI-7399:
---

 Summary: Add IT tests for hudi-aws sync
 Key: HUDI-7399
 URL: https://issues.apache.org/jira/browse/HUDI-7399
 Project: Apache Hudi
  Issue Type: Test
Reporter: nicolas paris
Assignee: nicolas paris


currently test coverage for hudi-aws sync is poor due to lack of aws glue 
binding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7362) Athena does not support s3a partition scheme anymore leading to missing data

2024-01-31 Thread nicolas paris (Jira)
nicolas paris created HUDI-7362:
---

 Summary:  Athena does not support s3a partition scheme anymore 
leading to missing data
 Key: HUDI-7362
 URL: https://issues.apache.org/jira/browse/HUDI-7362
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


see https://github.com/apache/hudi/issues/10595



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7351) Hive-sync partition pushdown does not work with glue

2024-01-26 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-7351:

Priority: Minor  (was: Major)

> Hive-sync partition pushdown does not work with glue
> 
>
> Key: HUDI-7351
> URL: https://issues.apache.org/jira/browse/HUDI-7351
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://github.com/apache/hudi/issues/10569



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7351) Hive-sync partition pushdown does not work with glue

2024-01-26 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-7351:

Fix Version/s: 1.0.0

> Hive-sync partition pushdown does not work with glue
> 
>
> Key: HUDI-7351
> URL: https://issues.apache.org/jira/browse/HUDI-7351
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
> Fix For: 1.0.0
>
>
> https://github.com/apache/hudi/issues/10569



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7351) Hive-sync partition pushdown does not work with glue

2024-01-26 Thread nicolas paris (Jira)
nicolas paris created HUDI-7351:
---

 Summary: Hive-sync partition pushdown does not work with glue
 Key: HUDI-7351
 URL: https://issues.apache.org/jira/browse/HUDI-7351
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris
Assignee: nicolas paris


https://github.com/apache/hudi/issues/10569



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7258) Fix dynamodb protocol endpoint

2023-12-25 Thread nicolas paris (Jira)
nicolas paris created HUDI-7258:
---

 Summary: Fix dynamodb protocol endpoint
 Key: HUDI-7258
 URL: https://issues.apache.org/jira/browse/HUDI-7258
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


cf https://github.com/apache/hudi/pull/10397



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7257) missing datadog configuration metrics on mdt

2023-12-25 Thread nicolas paris (Jira)
nicolas paris created HUDI-7257:
---

 Summary: missing datadog configuration metrics on mdt
 Key: HUDI-7257
 URL: https://issues.apache.org/jira/browse/HUDI-7257
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


cf https://github.com/apache/hudi/issues/10403



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6369) Spacial curve with sample strategy fails when 0 or 1 rows only is incoming

2023-06-26 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6369:
---

Assignee: nicolas paris

> Spacial curve with sample strategy fails when 0 or 1 rows only is incoming
> --
>
> Key: HUDI-6369
> URL: https://issues.apache.org/jira/browse/HUDI-6369
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/8934]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6400) Upsert merger should fail user configured class not found

2023-06-16 Thread nicolas paris (Jira)
nicolas paris created HUDI-6400:
---

 Summary: Upsert merger should fail user configured class not found
 Key: HUDI-6400
 URL: https://issues.apache.org/jira/browse/HUDI-6400
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


Currently when the user's specified class does not exists, then this silently 
fall back to the default merger. It can corrupt silently the data by applying 
wrong logic and should fail instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6399) Datadog metric reporter should not hard fail when api key is invalid

2023-06-16 Thread nicolas paris (Jira)
nicolas paris created HUDI-6399:
---

 Summary: Datadog metric reporter should not hard fail when api key 
is invalid
 Key: HUDI-6399
 URL: https://issues.apache.org/jira/browse/HUDI-6399
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6365) Duplicate hive sync tool process when custom class is specified

2023-06-12 Thread nicolas paris (Jira)
nicolas paris created HUDI-6365:
---

 Summary: Duplicate hive sync tool process when custom class is 
specified
 Key: HUDI-6365
 URL: https://issues.apache.org/jira/browse/HUDI-6365
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


https://github.com/apache/hudi/issues/8942



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6365) Duplicate hive sync tool process when custom class is specified

2023-06-12 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6365:
---

Assignee: nicolas paris

> Duplicate hive sync tool process when custom class is specified
> ---
>
> Key: HUDI-6365
> URL: https://issues.apache.org/jira/browse/HUDI-6365
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>
> https://github.com/apache/hudi/issues/8942



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6354) MergedReadHandle breaks with ExpressionPayload

2023-06-12 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6354:
---

Assignee: (was: nicolas paris)

> MergedReadHandle breaks with ExpressionPayload
> --
>
> Key: HUDI-6354
> URL: https://issues.apache.org/jira/browse/HUDI-6354
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6354) MergedReadHandle breaks with ExpressionPayload

2023-06-12 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6354:
---

Assignee: nicolas paris

> MergedReadHandle breaks with ExpressionPayload
> --
>
> Key: HUDI-6354
> URL: https://issues.apache.org/jira/browse/HUDI-6354
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Raymond Xu
>Assignee: nicolas paris
>Priority: Blocker
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6362) Hive sync update property and serde when no schema change

2023-06-12 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-6362:

Fix Version/s: 0.14.0

> Hive sync update property and serde when no schema change
> -
>
> Key: HUDI-6362
> URL: https://issues.apache.org/jira/browse/HUDI-6362
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
> Fix For: 0.14.0
>
>
> currently hive sync will update the table properties only when there is a 
> schema change.
> When user want to modify the table properties it has tow options:
>  # recreate the table from scratch
>  # wait until the schema changes
> It would be convenient to let users update the schema whenever hive sync runs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6362) Hive sync update property and serde when no schema change

2023-06-12 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris closed HUDI-6362.
---
Resolution: Fixed

> Hive sync update property and serde when no schema change
> -
>
> Key: HUDI-6362
> URL: https://issues.apache.org/jira/browse/HUDI-6362
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
> Fix For: 0.14.0
>
>
> currently hive sync will update the table properties only when there is a 
> schema change.
> When user want to modify the table properties it has tow options:
>  # recreate the table from scratch
>  # wait until the schema changes
> It would be convenient to let users update the schema whenever hive sync runs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6362) Hive sync update property and serde when no schema change

2023-06-12 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6362:
---

Assignee: nicolas paris

> Hive sync update property and serde when no schema change
> -
>
> Key: HUDI-6362
> URL: https://issues.apache.org/jira/browse/HUDI-6362
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>
> currently hive sync will update the table properties only when there is a 
> schema change.
> When user want to modify the table properties it has tow options:
>  # recreate the table from scratch
>  # wait until the schema changes
> It would be convenient to let users update the schema whenever hive sync runs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6362) Hive sync update property and serde when no schema change

2023-06-12 Thread nicolas paris (Jira)
nicolas paris created HUDI-6362:
---

 Summary: Hive sync update property and serde when no schema change
 Key: HUDI-6362
 URL: https://issues.apache.org/jira/browse/HUDI-6362
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: nicolas paris


currently hive sync will update the table properties only when there is a 
schema change.

When user want to modify the table properties it has tow options:
 # recreate the table from scratch
 # wait until the schema changes

It would be convenient to let users update the schema whenever hive sync runs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6072) Fix NPE when upsert merger and null map or array

2023-06-10 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6072:
---

Assignee: nicolas paris

> Fix NPE when upsert merger and null map or array
> 
>
> Key: HUDI-6072
> URL: https://issues.apache.org/jira/browse/HUDI-6072
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Danny Chen
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6349) Merger fails when nested type changes nullability support

2023-06-10 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6349:
---

Assignee: nicolas paris

> Merger fails when nested type changes nullability support
> -
>
> Key: HUDI-6349
> URL: https://issues.apache.org/jira/browse/HUDI-6349
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/8920



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6350) AWS Hive sync: allow to enable/disable MDT on athena

2023-06-10 Thread nicolas paris (Jira)
nicolas paris created HUDI-6350:
---

 Summary: AWS Hive sync: allow to enable/disable MDT on athena 
 Key: HUDI-6350
 URL: https://issues.apache.org/jira/browse/HUDI-6350
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: nicolas paris


athena has a nice (but hidden) feature to leverage the hudi metadata table 
instead of listing files on s3. This in theorry reduce the s3 slow down trouble 
(too much listing), speeds-up query planning.

 

THis can  be easily achieved by adding table property:

hudi.metadata-listing-enabled'='TRUE"

 

While on athena v2, this feature really helps, on athena v3 at the time of 
writing this, something is going very wrong and the query can be x100 slower. 

see https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6349) Merger fails when nested type changes nullability support

2023-06-10 Thread nicolas paris (Jira)
nicolas paris created HUDI-6349:
---

 Summary: Merger fails when nested type changes nullability support
 Key: HUDI-6349
 URL: https://issues.apache.org/jira/browse/HUDI-6349
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


https://github.com/apache/hudi/issues/8920



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6231) Hive sync aws to support comments on columns

2023-05-17 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6231:
---

Assignee: nicolas paris

> Hive sync aws to support comments on columns
> 
>
> Key: HUDI-6231
> URL: https://issues.apache.org/jira/browse/HUDI-6231
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>
> So far only hive sync vanilla metastore support columns comments



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6230) Make hive sync aws support partition indexes

2023-05-17 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6230:
---

Assignee: nicolas paris

> Make hive sync aws support partition indexes
> 
>
> Key: HUDI-6230
> URL: https://issues.apache.org/jira/browse/HUDI-6230
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>
> glue provide indexing features, that speedup a lot partition retrieval 
> So far it is not supported. Having a new hive-sync configuration to activate 
> the feature, and optionally provide which partitions columns to index would 
> be helpful.
> Also this is an operation that should not be done at creation table time, but 
> could be activated/deactivated at will
>  
> https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6231) Hive sync aws to support comments on columns

2023-05-17 Thread nicolas paris (Jira)
nicolas paris created HUDI-6231:
---

 Summary: Hive sync aws to support comments on columns
 Key: HUDI-6231
 URL: https://issues.apache.org/jira/browse/HUDI-6231
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


So far only hive sync vanilla metastore support columns comments



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6230) Make hive sync aws support partition indexes

2023-05-17 Thread nicolas paris (Jira)
nicolas paris created HUDI-6230:
---

 Summary: Make hive sync aws support partition indexes
 Key: HUDI-6230
 URL: https://issues.apache.org/jira/browse/HUDI-6230
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


glue provide indexing features, that speedup a lot partition retrieval 

So far it is not supported. Having a new hive-sync configuration to activate 
the feature, and optionally provide which partitions columns to index would be 
helpful.

Also this is an operation that should not be done at creation table time, but 
could be activated/deactivated at will

 
https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6226) Leverage parquet bloom filter feature

2023-05-16 Thread nicolas paris (Jira)
nicolas paris created HUDI-6226:
---

 Summary: Leverage parquet bloom filter feature
 Key: HUDI-6226
 URL: https://issues.apache.org/jira/browse/HUDI-6226
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


hudi should support parquet vanilla bloom filters, because this is a standard 
optimization method supported by every query engines using parquet 1.12 and 
above. Moreover hudi does not provide such optimization method. Hudi blooms are 
not used for select queries. Hudi blooms are only useful for update operations. 
Providing vanilla parquet bloom support to hudi would allow an other set of 
optimization (such z-order, parquet stats) for almost free.

see [https://github.com/apache/hudi/issues/7117]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5533) Table comments not showing up on spark-sql describe

2023-05-11 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-5533:
---

Assignee: nicolas paris

> Table comments not showing up on spark-sql describe
> ---
>
> Key: HUDI-5533
> URL: https://issues.apache.org/jira/browse/HUDI-5533
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Assignee: nicolas paris
>Priority: Minor
>  Labels: pull-request-available
>
> If you add a comment to the schema and write to a hudi table, the comment 
> will show as null when using spark-sql describe on the table.
>  
> User reported issue [https://github.com/apache/hudi/issues/7531] with a very 
> good reproducible example. The issue presented when I tried the example.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6150) Make hive sync to provide bucketing metadata when index=bucket

2023-05-06 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-6150:
---

Assignee: nicolas paris

> Make hive sync to provide bucketing metadata when index=bucket
> --
>
> Key: HUDI-6150
> URL: https://issues.apache.org/jira/browse/HUDI-6150
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Minor
> Fix For: 0.14.0
>
>
> So far hive tables are only bucketed when the strategy used is jdbc. 
> Bucketing informations are used by the query engines to improve the plan. We 
> could provide bucketing for other strategies :
>  * hms
>  * sql
>  * AWS glue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6150) Make hive sync to provide bucketing metadata when index=bucket

2023-04-28 Thread nicolas paris (Jira)
nicolas paris created HUDI-6150:
---

 Summary: Make hive sync to provide bucketing metadata when 
index=bucket
 Key: HUDI-6150
 URL: https://issues.apache.org/jira/browse/HUDI-6150
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


So far hive tables are only bucketed when the strategy used is jdbc. Bucketing 
informations are used by the query engines to improve the plan. We could 
provide bucketing for other strategies :
 * hms
 * sql
 * AWS glue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6061) NPE with nullable MapType and new hudi merger

2023-04-11 Thread nicolas paris (Jira)
nicolas paris created HUDI-6061:
---

 Summary: NPE with nullable MapType and new hudi merger
 Key: HUDI-6061
 URL: https://issues.apache.org/jira/browse/HUDI-6061
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Reporter: nicolas paris
 Fix For: 0.13.1


In 0.13.0, when dealing with null map values during an upsert with the new hudi 
merger api, then null pointer raises. AFAIK, it happens when both MapTypes are 
containing null in different maner.

 

See [issue]([https://github.com/apache/hudi/issues/8431)] for details

See [PR]([https://github.com/apache/hudi/pull/8432)] for details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects

2022-10-07 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4995:

Priority: Minor  (was: Major)

> Depency conflicts on apache http with other projects
> 
>
> Key: HUDI-4995
> URL: https://issues.apache.org/jira/browse/HUDI-4995
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Priority: Minor
> Fix For: 0.12.1
>
>
> Hudi imports org.apache.http which can collide with other libs such 
> elasticsearch client. This makes the spark-bundle create conflicts when use 
> both libs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects

2022-10-07 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4995:

Fix Version/s: 0.12.1

> Depency conflicts on apache http with other projects
> 
>
> Key: HUDI-4995
> URL: https://issues.apache.org/jira/browse/HUDI-4995
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Priority: Major
> Fix For: 0.12.1
>
>
> Hudi imports org.apache.http which can collide with other libs such 
> elasticsearch client. This makes the spark-bundle create conflicts when use 
> both libs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4995) Depency conflicts on apache http with other projects

2022-10-07 Thread nicolas paris (Jira)
nicolas paris created HUDI-4995:
---

 Summary: Depency conflicts on apache http with other projects
 Key: HUDI-4995
 URL: https://issues.apache.org/jira/browse/HUDI-4995
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


Hudi imports org.apache.http which can collide with other libs such 
elasticsearch client. This makes the spark-bundle create conflicts when use 
both libs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4781) Allow omit metadata fields for hive sync

2022-09-22 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4781:

Fix Version/s: 0.13.0
   (was: 0.12.1)

> Allow omit metadata fields for hive sync
> 
>
> Key: HUDI-4781
> URL: https://issues.apache.org/jira/browse/HUDI-4781
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: nicolas paris
>Priority: Minor
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.13.0
>
>
> Wh
> en true, this won't create the metadata fields in the hive table, and hide 
> them for end users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4781) Allow omit metadata fields for hive sync

2022-09-22 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4781:

Fix Version/s: 0.12.1
   (was: 0.13.0)

> Allow omit metadata fields for hive sync
> 
>
> Key: HUDI-4781
> URL: https://issues.apache.org/jira/browse/HUDI-4781
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: nicolas paris
>Priority: Minor
>  Labels: hudi-on-call, pull-request-available
> Fix For: 0.12.1
>
>
> Wh
> en true, this won't create the metadata fields in the hive table, and hide 
> them for end users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4764) AwsglueSync turn already exist error into warning

2022-09-22 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4764:

Fix Version/s: 0.12.1

> AwsglueSync turn already exist error into warning
> -
>
> Key: HUDI-4764
> URL: https://issues.apache.org/jira/browse/HUDI-4764
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Priority: Major
> Fix For: 0.12.1
>
>
> In some condition (OCC?) the AWSGlueCatalogSyncClient fails with already 
> exist exception for partition. In any case, if a given partition exist this 
> should not lead to fail the sync, but raise a warning  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4792) Speed up cleaning with metadata table enabled

2022-09-06 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris reassigned HUDI-4792:
---

Assignee: nicolas paris

> Speed up cleaning with metadata table enabled 
> --
>
> Key: HUDI-4792
> URL: https://issues.apache.org/jira/browse/HUDI-4792
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> Currently fetching file group to be deleted is parallelized over each 
> partition. As a result, in case of many partition, many calls are made on the 
> metadata. While this is ok for file system view, this is highly inefficient 
> with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
> happens on the MDT and in the case of thousand of partitions the process is 
> incredibly slow. 
> I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on 
> a 40k partitionned hudi table :
>  * w/ MDT: 5 hours
>  * w/o MDT: 5 minutes
> This slowness makes the use of MDT not reasonable in the case of many 
> partitions, because cleaning is a  must-have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4792) Speed up cleaning with metadata table enabled

2022-09-06 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4792:

Description: 
Currently fetching file group to be deleted is parallelized over each 
partition. As a result, in case of many partition, many calls are made on the 
metadata. While this is ok for file system view, this is highly inefficient 
with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
happens on the MDT and in the case of thousand of partitions the process is 
incredibly slow. 

I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on a 
40k partitionned hudi table :
 * w/ MDT: 5 hours
 * w/o MDT: 5 minutes

This slowness makes the use of MDT not reasonable in the case of many 
partitions, because cleaning is a  must-have.

  was:
Currently fetching file group to be deleted is parallelized over each 
partition. As a result, in case of many partition, many calls are made on the 
metadata. While this is ok for file system view, this is highly inefficient 
with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
happens on the MDT and in the case of thousand of partitions the process is 
incredibly slow. 

I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on a 
40k partitionned hudi table :
 * w/ MDT: 5 hours
 * w/o MDT: 5 minutes


> Speed up cleaning with metadata table enabled 
> --
>
> Key: HUDI-4792
> URL: https://issues.apache.org/jira/browse/HUDI-4792
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> Currently fetching file group to be deleted is parallelized over each 
> partition. As a result, in case of many partition, many calls are made on the 
> metadata. While this is ok for file system view, this is highly inefficient 
> with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
> happens on the MDT and in the case of thousand of partitions the process is 
> incredibly slow. 
> I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on 
> a 40k partitionned hudi table :
>  * w/ MDT: 5 hours
>  * w/o MDT: 5 minutes
> This slowness makes the use of MDT not reasonable in the case of many 
> partitions, because cleaning is a  must-have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4792) Speed up cleaning with metadata table enabled

2022-09-06 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4792:

Description: 
Currently fetching file group to be deleted is parallelized over each 
partition. As a result, in case of many partition, many calls are made on the 
metadata. While this is ok for file system view, this is highly inefficient 
with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
happens on the MDT and in the case of thousand of partitions the process is 
incredibly slow. 

I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on a 
40k partitionned hudi table :
 * w/ MDT: 5 hours
 * w/o MDT: 5 minutes

  was:
Currently fetching file group to be deleted is parallelized over each 
partition. As a result, in case of many partition, many calls are made on the 
metadata. While this is ok for file system view, this is highly inefficient 
with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
happens on the MDT and in the case of thousand of partitions the process is 
incredibly slow.

I benchmarked cleaning on the same table w/ and w/o MDT on a 40k partitionned 
hudi table :
 * w/ MDT: 5 hours
 * w/o MDT: 5 minutes


> Speed up cleaning with metadata table enabled 
> --
>
> Key: HUDI-4792
> URL: https://issues.apache.org/jira/browse/HUDI-4792
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> Currently fetching file group to be deleted is parallelized over each 
> partition. As a result, in case of many partition, many calls are made on the 
> metadata. While this is ok for file system view, this is highly inefficient 
> with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
> happens on the MDT and in the case of thousand of partitions the process is 
> incredibly slow. 
> I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on 
> a 40k partitionned hudi table :
>  * w/ MDT: 5 hours
>  * w/o MDT: 5 minutes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4792) Speed up cleaning with metadata table enabled

2022-09-06 Thread nicolas paris (Jira)
nicolas paris created HUDI-4792:
---

 Summary: Speed up cleaning with metadata table enabled 
 Key: HUDI-4792
 URL: https://issues.apache.org/jira/browse/HUDI-4792
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


Currently fetching file group to be deleted is parallelized over each 
partition. As a result, in case of many partition, many calls are made on the 
metadata. While this is ok for file system view, this is highly inefficient 
with the metadata table view (MDT){*}.{*} Likely each call makes the MoR 
happens on the MDT and in the case of thousand of partitions the process is 
incredibly slow.

I benchmarked cleaning on the same table w/ and w/o MDT on a 40k partitionned 
hudi table :
 * w/ MDT: 5 hours
 * w/o MDT: 5 minutes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4764) AwsglueSync turn already exist error into warning

2022-09-01 Thread nicolas paris (Jira)
nicolas paris created HUDI-4764:
---

 Summary: AwsglueSync turn already exist error into warning
 Key: HUDI-4764
 URL: https://issues.apache.org/jira/browse/HUDI-4764
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


In some condition (OCC?) the AWSGlueCatalogSyncClient fails with already exist 
exception for partition. In any case, if a given partition exist this should 
not lead to fail the sync, but raise a warning  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4763) Allow hoodie read client to choose index

2022-09-01 Thread nicolas paris (Jira)
nicolas paris created HUDI-4763:
---

 Summary: Allow hoodie read client to choose index
 Key: HUDI-4763
 URL: https://issues.apache.org/jira/browse/HUDI-4763
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


currently the HoodieReadCLient has hardcoded bloom index. We should allow to 
choose for eg GLOBAL_BLOOM.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4762) Hive sync update schema removes columns

2022-09-01 Thread nicolas paris (Jira)
nicolas paris created HUDI-4762:
---

 Summary: Hive sync update schema removes columns 
 Key: HUDI-4762
 URL: https://issues.apache.org/jira/browse/HUDI-4762
 Project: Apache Hudi
  Issue Type: Bug
Reporter: nicolas paris


Currently when move a hudi table from schema1 to schema2 and then insert data 
with the old schema1, then schema 2 is kept for the whole table.

This is not consistent with hive metastore which get its schema updated to the 
old schema1.

Avoid update metastore schema if only missing column in input



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-2427) SQL stmt broken with spark 3.1.x

2021-09-13 Thread nicolas paris (Jira)
nicolas paris created HUDI-2427:
---

 Summary: SQL stmt broken with spark 3.1.x
 Key: HUDI-2427
 URL: https://issues.apache.org/jira/browse/HUDI-2427
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Reporter: nicolas paris


In my experiments, the new SQL stmt features of hudi 0.9 does not work with 
spark 3.1.x but only with spark 3.0.x

step to reproduce:
 {{spark-3.1.2-bin-hadoop2.7/bin/spark-shell \
--packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.1.2
 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

spark.sql("""
create table h3 using hudi
as
select 1 as id, 'a1' as name, 10 as price
""")

java.lang.NoSuchMethodError: 
org.apache.spark.sql.catalyst.expressions.Alias.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V
  at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.$anonfun$alignOutputFields$6(InsertIntoHoodieTableCommand.scala:152)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:148)
  at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:95)
  at 
org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:84)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
  at org.apache.spark.sql.Dataset.(Dataset.scala:228)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
  ... 60 elided}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2426) spark sql extensions breaks read.table from metastore

2021-09-13 Thread nicolas paris (Jira)
nicolas paris created HUDI-2426:
---

 Summary: spark sql extensions breaks read.table from metastore
 Key: HUDI-2426
 URL: https://issues.apache.org/jira/browse/HUDI-2426
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: nicolas paris


when adding the hudi spark sql support, this breaks the ability to read a hudi 
metastore from spark:

 bash-4.2$ ./spark3.0.2/bin/spark-shell --packages 
org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.1.2
 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

 

scala> spark.table("default.test_hudi_table").show
java.lang.UnsupportedOperationException: Unsupported parseMultipartIdentifier 
method
 at 
org.apache.spark.sql.parser.HoodieCommonSqlParser.parseMultipartIdentifier(HoodieCommonSqlParser.scala:65)
 at org.apache.spark.sql.SparkSession.table(SparkSession.scala:581)
 ... 47 elided

 

removing the config makes the hive table readable again from spark

this affect at least spark 3.0.x and 3.1.x



--
This message was sent by Atlassian Jira
(v8.3.4#803005)