[jira] [Assigned] (HUDI-7494) multi writer sync partition to glue will missing some partitions
[ https://issues.apache.org/jira/browse/HUDI-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-7494: --- Assignee: nicolas paris > multi writer sync partition to glue will missing some partitions > > > Key: HUDI-7494 > URL: https://issues.apache.org/jira/browse/HUDI-7494 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > > Glue still is affected during multi writers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7494) multi writer sync partition to glue will missing some partitions
[ https://issues.apache.org/jira/browse/HUDI-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-7494: Issue Type: Bug (was: Test) > multi writer sync partition to glue will missing some partitions > > > Key: HUDI-7494 > URL: https://issues.apache.org/jira/browse/HUDI-7494 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Priority: Major > > Glue still is affected during multi writers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7494) multi writer sync partition to glue will missing some partitions
nicolas paris created HUDI-7494: --- Summary: multi writer sync partition to glue will missing some partitions Key: HUDI-7494 URL: https://issues.apache.org/jira/browse/HUDI-7494 Project: Apache Hudi Issue Type: Test Reporter: nicolas paris Glue still is affected during multi writers -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7399) Add IT tests for hudi-aws sync
nicolas paris created HUDI-7399: --- Summary: Add IT tests for hudi-aws sync Key: HUDI-7399 URL: https://issues.apache.org/jira/browse/HUDI-7399 Project: Apache Hudi Issue Type: Test Reporter: nicolas paris Assignee: nicolas paris currently test coverage for hudi-aws sync is poor due to lack of aws glue binding. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7362) Athena does not support s3a partition scheme anymore leading to missing data
nicolas paris created HUDI-7362: --- Summary: Athena does not support s3a partition scheme anymore leading to missing data Key: HUDI-7362 URL: https://issues.apache.org/jira/browse/HUDI-7362 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris see https://github.com/apache/hudi/issues/10595 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7351) Hive-sync partition pushdown does not work with glue
[ https://issues.apache.org/jira/browse/HUDI-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-7351: Priority: Minor (was: Major) > Hive-sync partition pushdown does not work with glue > > > Key: HUDI-7351 > URL: https://issues.apache.org/jira/browse/HUDI-7351 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Minor > Fix For: 1.0.0 > > > https://github.com/apache/hudi/issues/10569 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7351) Hive-sync partition pushdown does not work with glue
[ https://issues.apache.org/jira/browse/HUDI-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-7351: Fix Version/s: 1.0.0 > Hive-sync partition pushdown does not work with glue > > > Key: HUDI-7351 > URL: https://issues.apache.org/jira/browse/HUDI-7351 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Fix For: 1.0.0 > > > https://github.com/apache/hudi/issues/10569 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7351) Hive-sync partition pushdown does not work with glue
nicolas paris created HUDI-7351: --- Summary: Hive-sync partition pushdown does not work with glue Key: HUDI-7351 URL: https://issues.apache.org/jira/browse/HUDI-7351 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris Assignee: nicolas paris https://github.com/apache/hudi/issues/10569 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7258) Fix dynamodb protocol endpoint
nicolas paris created HUDI-7258: --- Summary: Fix dynamodb protocol endpoint Key: HUDI-7258 URL: https://issues.apache.org/jira/browse/HUDI-7258 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris cf https://github.com/apache/hudi/pull/10397 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7257) missing datadog configuration metrics on mdt
nicolas paris created HUDI-7257: --- Summary: missing datadog configuration metrics on mdt Key: HUDI-7257 URL: https://issues.apache.org/jira/browse/HUDI-7257 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris cf https://github.com/apache/hudi/issues/10403 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6369) Spacial curve with sample strategy fails when 0 or 1 rows only is incoming
[ https://issues.apache.org/jira/browse/HUDI-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6369: --- Assignee: nicolas paris > Spacial curve with sample strategy fails when 0 or 1 rows only is incoming > -- > > Key: HUDI-6369 > URL: https://issues.apache.org/jira/browse/HUDI-6369 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Github Issue - [https://github.com/apache/hudi/issues/8934] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6400) Upsert merger should fail user configured class not found
nicolas paris created HUDI-6400: --- Summary: Upsert merger should fail user configured class not found Key: HUDI-6400 URL: https://issues.apache.org/jira/browse/HUDI-6400 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris Currently when the user's specified class does not exists, then this silently fall back to the default merger. It can corrupt silently the data by applying wrong logic and should fail instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6399) Datadog metric reporter should not hard fail when api key is invalid
nicolas paris created HUDI-6399: --- Summary: Datadog metric reporter should not hard fail when api key is invalid Key: HUDI-6399 URL: https://issues.apache.org/jira/browse/HUDI-6399 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6365) Duplicate hive sync tool process when custom class is specified
nicolas paris created HUDI-6365: --- Summary: Duplicate hive sync tool process when custom class is specified Key: HUDI-6365 URL: https://issues.apache.org/jira/browse/HUDI-6365 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris https://github.com/apache/hudi/issues/8942 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6365) Duplicate hive sync tool process when custom class is specified
[ https://issues.apache.org/jira/browse/HUDI-6365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6365: --- Assignee: nicolas paris > Duplicate hive sync tool process when custom class is specified > --- > > Key: HUDI-6365 > URL: https://issues.apache.org/jira/browse/HUDI-6365 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > > https://github.com/apache/hudi/issues/8942 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6354) MergedReadHandle breaks with ExpressionPayload
[ https://issues.apache.org/jira/browse/HUDI-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6354: --- Assignee: (was: nicolas paris) > MergedReadHandle breaks with ExpressionPayload > -- > > Key: HUDI-6354 > URL: https://issues.apache.org/jira/browse/HUDI-6354 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Raymond Xu >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6354) MergedReadHandle breaks with ExpressionPayload
[ https://issues.apache.org/jira/browse/HUDI-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6354: --- Assignee: nicolas paris > MergedReadHandle breaks with ExpressionPayload > -- > > Key: HUDI-6354 > URL: https://issues.apache.org/jira/browse/HUDI-6354 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Raymond Xu >Assignee: nicolas paris >Priority: Blocker > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6362) Hive sync update property and serde when no schema change
[ https://issues.apache.org/jira/browse/HUDI-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-6362: Fix Version/s: 0.14.0 > Hive sync update property and serde when no schema change > - > > Key: HUDI-6362 > URL: https://issues.apache.org/jira/browse/HUDI-6362 > Project: Apache Hudi > Issue Type: New Feature >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Fix For: 0.14.0 > > > currently hive sync will update the table properties only when there is a > schema change. > When user want to modify the table properties it has tow options: > # recreate the table from scratch > # wait until the schema changes > It would be convenient to let users update the schema whenever hive sync runs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6362) Hive sync update property and serde when no schema change
[ https://issues.apache.org/jira/browse/HUDI-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris closed HUDI-6362. --- Resolution: Fixed > Hive sync update property and serde when no schema change > - > > Key: HUDI-6362 > URL: https://issues.apache.org/jira/browse/HUDI-6362 > Project: Apache Hudi > Issue Type: New Feature >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Fix For: 0.14.0 > > > currently hive sync will update the table properties only when there is a > schema change. > When user want to modify the table properties it has tow options: > # recreate the table from scratch > # wait until the schema changes > It would be convenient to let users update the schema whenever hive sync runs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6362) Hive sync update property and serde when no schema change
[ https://issues.apache.org/jira/browse/HUDI-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6362: --- Assignee: nicolas paris > Hive sync update property and serde when no schema change > - > > Key: HUDI-6362 > URL: https://issues.apache.org/jira/browse/HUDI-6362 > Project: Apache Hudi > Issue Type: New Feature >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > > currently hive sync will update the table properties only when there is a > schema change. > When user want to modify the table properties it has tow options: > # recreate the table from scratch > # wait until the schema changes > It would be convenient to let users update the schema whenever hive sync runs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6362) Hive sync update property and serde when no schema change
nicolas paris created HUDI-6362: --- Summary: Hive sync update property and serde when no schema change Key: HUDI-6362 URL: https://issues.apache.org/jira/browse/HUDI-6362 Project: Apache Hudi Issue Type: New Feature Reporter: nicolas paris currently hive sync will update the table properties only when there is a schema change. When user want to modify the table properties it has tow options: # recreate the table from scratch # wait until the schema changes It would be convenient to let users update the schema whenever hive sync runs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6072) Fix NPE when upsert merger and null map or array
[ https://issues.apache.org/jira/browse/HUDI-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6072: --- Assignee: nicolas paris > Fix NPE when upsert merger and null map or array > > > Key: HUDI-6072 > URL: https://issues.apache.org/jira/browse/HUDI-6072 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Danny Chen >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6349) Merger fails when nested type changes nullability support
[ https://issues.apache.org/jira/browse/HUDI-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6349: --- Assignee: nicolas paris > Merger fails when nested type changes nullability support > - > > Key: HUDI-6349 > URL: https://issues.apache.org/jira/browse/HUDI-6349 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/issues/8920 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6350) AWS Hive sync: allow to enable/disable MDT on athena
nicolas paris created HUDI-6350: --- Summary: AWS Hive sync: allow to enable/disable MDT on athena Key: HUDI-6350 URL: https://issues.apache.org/jira/browse/HUDI-6350 Project: Apache Hudi Issue Type: New Feature Reporter: nicolas paris athena has a nice (but hidden) feature to leverage the hudi metadata table instead of listing files on s3. This in theorry reduce the s3 slow down trouble (too much listing), speeds-up query planning. THis can be easily achieved by adding table property: hudi.metadata-listing-enabled'='TRUE" While on athena v2, this feature really helps, on athena v3 at the time of writing this, something is going very wrong and the query can be x100 slower. see https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6349) Merger fails when nested type changes nullability support
nicolas paris created HUDI-6349: --- Summary: Merger fails when nested type changes nullability support Key: HUDI-6349 URL: https://issues.apache.org/jira/browse/HUDI-6349 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris https://github.com/apache/hudi/issues/8920 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6231) Hive sync aws to support comments on columns
[ https://issues.apache.org/jira/browse/HUDI-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6231: --- Assignee: nicolas paris > Hive sync aws to support comments on columns > > > Key: HUDI-6231 > URL: https://issues.apache.org/jira/browse/HUDI-6231 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > > So far only hive sync vanilla metastore support columns comments -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6230) Make hive sync aws support partition indexes
[ https://issues.apache.org/jira/browse/HUDI-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6230: --- Assignee: nicolas paris > Make hive sync aws support partition indexes > > > Key: HUDI-6230 > URL: https://issues.apache.org/jira/browse/HUDI-6230 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > > glue provide indexing features, that speedup a lot partition retrieval > So far it is not supported. Having a new hive-sync configuration to activate > the feature, and optionally provide which partitions columns to index would > be helpful. > Also this is an operation that should not be done at creation table time, but > could be activated/deactivated at will > > https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6231) Hive sync aws to support comments on columns
nicolas paris created HUDI-6231: --- Summary: Hive sync aws to support comments on columns Key: HUDI-6231 URL: https://issues.apache.org/jira/browse/HUDI-6231 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris So far only hive sync vanilla metastore support columns comments -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6230) Make hive sync aws support partition indexes
nicolas paris created HUDI-6230: --- Summary: Make hive sync aws support partition indexes Key: HUDI-6230 URL: https://issues.apache.org/jira/browse/HUDI-6230 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris glue provide indexing features, that speedup a lot partition retrieval So far it is not supported. Having a new hive-sync configuration to activate the feature, and optionally provide which partitions columns to index would be helpful. Also this is an operation that should not be done at creation table time, but could be activated/deactivated at will https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6226) Leverage parquet bloom filter feature
nicolas paris created HUDI-6226: --- Summary: Leverage parquet bloom filter feature Key: HUDI-6226 URL: https://issues.apache.org/jira/browse/HUDI-6226 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris hudi should support parquet vanilla bloom filters, because this is a standard optimization method supported by every query engines using parquet 1.12 and above. Moreover hudi does not provide such optimization method. Hudi blooms are not used for select queries. Hudi blooms are only useful for update operations. Providing vanilla parquet bloom support to hudi would allow an other set of optimization (such z-order, parquet stats) for almost free. see [https://github.com/apache/hudi/issues/7117] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5533) Table comments not showing up on spark-sql describe
[ https://issues.apache.org/jira/browse/HUDI-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-5533: --- Assignee: nicolas paris > Table comments not showing up on spark-sql describe > --- > > Key: HUDI-5533 > URL: https://issues.apache.org/jira/browse/HUDI-5533 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Jonathan Vexler >Assignee: nicolas paris >Priority: Minor > Labels: pull-request-available > > If you add a comment to the schema and write to a hudi table, the comment > will show as null when using spark-sql describe on the table. > > User reported issue [https://github.com/apache/hudi/issues/7531] with a very > good reproducible example. The issue presented when I tried the example. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6150) Make hive sync to provide bucketing metadata when index=bucket
[ https://issues.apache.org/jira/browse/HUDI-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-6150: --- Assignee: nicolas paris > Make hive sync to provide bucketing metadata when index=bucket > -- > > Key: HUDI-6150 > URL: https://issues.apache.org/jira/browse/HUDI-6150 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Minor > Fix For: 0.14.0 > > > So far hive tables are only bucketed when the strategy used is jdbc. > Bucketing informations are used by the query engines to improve the plan. We > could provide bucketing for other strategies : > * hms > * sql > * AWS glue -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6150) Make hive sync to provide bucketing metadata when index=bucket
nicolas paris created HUDI-6150: --- Summary: Make hive sync to provide bucketing metadata when index=bucket Key: HUDI-6150 URL: https://issues.apache.org/jira/browse/HUDI-6150 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris So far hive tables are only bucketed when the strategy used is jdbc. Bucketing informations are used by the query engines to improve the plan. We could provide bucketing for other strategies : * hms * sql * AWS glue -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6061) NPE with nullable MapType and new hudi merger
nicolas paris created HUDI-6061: --- Summary: NPE with nullable MapType and new hudi merger Key: HUDI-6061 URL: https://issues.apache.org/jira/browse/HUDI-6061 Project: Apache Hudi Issue Type: Bug Components: core Reporter: nicolas paris Fix For: 0.13.1 In 0.13.0, when dealing with null map values during an upsert with the new hudi merger api, then null pointer raises. AFAIK, it happens when both MapTypes are containing null in different maner. See [issue]([https://github.com/apache/hudi/issues/8431)] for details See [PR]([https://github.com/apache/hudi/pull/8432)] for details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects
[ https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4995: Priority: Minor (was: Major) > Depency conflicts on apache http with other projects > > > Key: HUDI-4995 > URL: https://issues.apache.org/jira/browse/HUDI-4995 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Priority: Minor > Fix For: 0.12.1 > > > Hudi imports org.apache.http which can collide with other libs such > elasticsearch client. This makes the spark-bundle create conflicts when use > both libs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects
[ https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4995: Fix Version/s: 0.12.1 > Depency conflicts on apache http with other projects > > > Key: HUDI-4995 > URL: https://issues.apache.org/jira/browse/HUDI-4995 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Priority: Major > Fix For: 0.12.1 > > > Hudi imports org.apache.http which can collide with other libs such > elasticsearch client. This makes the spark-bundle create conflicts when use > both libs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4995) Depency conflicts on apache http with other projects
nicolas paris created HUDI-4995: --- Summary: Depency conflicts on apache http with other projects Key: HUDI-4995 URL: https://issues.apache.org/jira/browse/HUDI-4995 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris Hudi imports org.apache.http which can collide with other libs such elasticsearch client. This makes the spark-bundle create conflicts when use both libs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4781) Allow omit metadata fields for hive sync
[ https://issues.apache.org/jira/browse/HUDI-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4781: Fix Version/s: 0.13.0 (was: 0.12.1) > Allow omit metadata fields for hive sync > > > Key: HUDI-4781 > URL: https://issues.apache.org/jira/browse/HUDI-4781 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: nicolas paris >Priority: Minor > Labels: hudi-on-call, pull-request-available > Fix For: 0.13.0 > > > Wh > en true, this won't create the metadata fields in the hive table, and hide > them for end users -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4781) Allow omit metadata fields for hive sync
[ https://issues.apache.org/jira/browse/HUDI-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4781: Fix Version/s: 0.12.1 (was: 0.13.0) > Allow omit metadata fields for hive sync > > > Key: HUDI-4781 > URL: https://issues.apache.org/jira/browse/HUDI-4781 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: nicolas paris >Priority: Minor > Labels: hudi-on-call, pull-request-available > Fix For: 0.12.1 > > > Wh > en true, this won't create the metadata fields in the hive table, and hide > them for end users -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4764) AwsglueSync turn already exist error into warning
[ https://issues.apache.org/jira/browse/HUDI-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4764: Fix Version/s: 0.12.1 > AwsglueSync turn already exist error into warning > - > > Key: HUDI-4764 > URL: https://issues.apache.org/jira/browse/HUDI-4764 > Project: Apache Hudi > Issue Type: Bug >Reporter: nicolas paris >Priority: Major > Fix For: 0.12.1 > > > In some condition (OCC?) the AWSGlueCatalogSyncClient fails with already > exist exception for partition. In any case, if a given partition exist this > should not lead to fail the sync, but raise a warning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4792) Speed up cleaning with metadata table enabled
[ https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris reassigned HUDI-4792: --- Assignee: nicolas paris > Speed up cleaning with metadata table enabled > -- > > Key: HUDI-4792 > URL: https://issues.apache.org/jira/browse/HUDI-4792 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > > Currently fetching file group to be deleted is parallelized over each > partition. As a result, in case of many partition, many calls are made on the > metadata. While this is ok for file system view, this is highly inefficient > with the metadata table view (MDT){*}.{*} Likely each call makes the MoR > happens on the MDT and in the case of thousand of partitions the process is > incredibly slow. > I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on > a 40k partitionned hudi table : > * w/ MDT: 5 hours > * w/o MDT: 5 minutes > This slowness makes the use of MDT not reasonable in the case of many > partitions, because cleaning is a must-have. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4792) Speed up cleaning with metadata table enabled
[ https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4792: Description: Currently fetching file group to be deleted is parallelized over each partition. As a result, in case of many partition, many calls are made on the metadata. While this is ok for file system view, this is highly inefficient with the metadata table view (MDT){*}.{*} Likely each call makes the MoR happens on the MDT and in the case of thousand of partitions the process is incredibly slow. I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on a 40k partitionned hudi table : * w/ MDT: 5 hours * w/o MDT: 5 minutes This slowness makes the use of MDT not reasonable in the case of many partitions, because cleaning is a must-have. was: Currently fetching file group to be deleted is parallelized over each partition. As a result, in case of many partition, many calls are made on the metadata. While this is ok for file system view, this is highly inefficient with the metadata table view (MDT){*}.{*} Likely each call makes the MoR happens on the MDT and in the case of thousand of partitions the process is incredibly slow. I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on a 40k partitionned hudi table : * w/ MDT: 5 hours * w/o MDT: 5 minutes > Speed up cleaning with metadata table enabled > -- > > Key: HUDI-4792 > URL: https://issues.apache.org/jira/browse/HUDI-4792 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Priority: Major > Labels: pull-request-available > > Currently fetching file group to be deleted is parallelized over each > partition. As a result, in case of many partition, many calls are made on the > metadata. While this is ok for file system view, this is highly inefficient > with the metadata table view (MDT){*}.{*} Likely each call makes the MoR > happens on the MDT and in the case of thousand of partitions the process is > incredibly slow. > I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on > a 40k partitionned hudi table : > * w/ MDT: 5 hours > * w/o MDT: 5 minutes > This slowness makes the use of MDT not reasonable in the case of many > partitions, because cleaning is a must-have. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4792) Speed up cleaning with metadata table enabled
[ https://issues.apache.org/jira/browse/HUDI-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4792: Description: Currently fetching file group to be deleted is parallelized over each partition. As a result, in case of many partition, many calls are made on the metadata. While this is ok for file system view, this is highly inefficient with the metadata table view (MDT){*}.{*} Likely each call makes the MoR happens on the MDT and in the case of thousand of partitions the process is incredibly slow. I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on a 40k partitionned hudi table : * w/ MDT: 5 hours * w/o MDT: 5 minutes was: Currently fetching file group to be deleted is parallelized over each partition. As a result, in case of many partition, many calls are made on the metadata. While this is ok for file system view, this is highly inefficient with the metadata table view (MDT){*}.{*} Likely each call makes the MoR happens on the MDT and in the case of thousand of partitions the process is incredibly slow. I benchmarked cleaning on the same table w/ and w/o MDT on a 40k partitionned hudi table : * w/ MDT: 5 hours * w/o MDT: 5 minutes > Speed up cleaning with metadata table enabled > -- > > Key: HUDI-4792 > URL: https://issues.apache.org/jira/browse/HUDI-4792 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Priority: Major > Labels: pull-request-available > > Currently fetching file group to be deleted is parallelized over each > partition. As a result, in case of many partition, many calls are made on the > metadata. While this is ok for file system view, this is highly inefficient > with the metadata table view (MDT){*}.{*} Likely each call makes the MoR > happens on the MDT and in the case of thousand of partitions the process is > incredibly slow. > I benchmarked (non incremental) cleaning on the same table w/ and w/o MDT on > a 40k partitionned hudi table : > * w/ MDT: 5 hours > * w/o MDT: 5 minutes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4792) Speed up cleaning with metadata table enabled
nicolas paris created HUDI-4792: --- Summary: Speed up cleaning with metadata table enabled Key: HUDI-4792 URL: https://issues.apache.org/jira/browse/HUDI-4792 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris Currently fetching file group to be deleted is parallelized over each partition. As a result, in case of many partition, many calls are made on the metadata. While this is ok for file system view, this is highly inefficient with the metadata table view (MDT){*}.{*} Likely each call makes the MoR happens on the MDT and in the case of thousand of partitions the process is incredibly slow. I benchmarked cleaning on the same table w/ and w/o MDT on a 40k partitionned hudi table : * w/ MDT: 5 hours * w/o MDT: 5 minutes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4764) AwsglueSync turn already exist error into warning
nicolas paris created HUDI-4764: --- Summary: AwsglueSync turn already exist error into warning Key: HUDI-4764 URL: https://issues.apache.org/jira/browse/HUDI-4764 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris In some condition (OCC?) the AWSGlueCatalogSyncClient fails with already exist exception for partition. In any case, if a given partition exist this should not lead to fail the sync, but raise a warning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4763) Allow hoodie read client to choose index
nicolas paris created HUDI-4763: --- Summary: Allow hoodie read client to choose index Key: HUDI-4763 URL: https://issues.apache.org/jira/browse/HUDI-4763 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris currently the HoodieReadCLient has hardcoded bloom index. We should allow to choose for eg GLOBAL_BLOOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4762) Hive sync update schema removes columns
nicolas paris created HUDI-4762: --- Summary: Hive sync update schema removes columns Key: HUDI-4762 URL: https://issues.apache.org/jira/browse/HUDI-4762 Project: Apache Hudi Issue Type: Bug Reporter: nicolas paris Currently when move a hudi table from schema1 to schema2 and then insert data with the old schema1, then schema 2 is kept for the whole table. This is not consistent with hive metastore which get its schema updated to the old schema1. Avoid update metastore schema if only missing column in input -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-2427) SQL stmt broken with spark 3.1.x
nicolas paris created HUDI-2427: --- Summary: SQL stmt broken with spark 3.1.x Key: HUDI-2427 URL: https://issues.apache.org/jira/browse/HUDI-2427 Project: Apache Hudi Issue Type: Bug Components: Common Core Reporter: nicolas paris In my experiments, the new SQL stmt features of hudi 0.9 does not work with spark 3.1.x but only with spark 3.0.x step to reproduce: {{spark-3.1.2-bin-hadoop2.7/bin/spark-shell \ --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.1.2 \ --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' spark.sql(""" create table h3 using hudi as select 1 as id, 'a1' as name, 10 as price """) java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.$anonfun$alignOutputFields$6(InsertIntoHoodieTableCommand.scala:152) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:148) at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:95) at org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:84) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685) at org.apache.spark.sql.Dataset.(Dataset.scala:228) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) ... 60 elided}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2426) spark sql extensions breaks read.table from metastore
nicolas paris created HUDI-2426: --- Summary: spark sql extensions breaks read.table from metastore Key: HUDI-2426 URL: https://issues.apache.org/jira/browse/HUDI-2426 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Reporter: nicolas paris when adding the hudi spark sql support, this breaks the ability to read a hudi metastore from spark: bash-4.2$ ./spark3.0.2/bin/spark-shell --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' scala> spark.table("default.test_hudi_table").show java.lang.UnsupportedOperationException: Unsupported parseMultipartIdentifier method at org.apache.spark.sql.parser.HoodieCommonSqlParser.parseMultipartIdentifier(HoodieCommonSqlParser.scala:65) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:581) ... 47 elided removing the config makes the hive table readable again from spark this affect at least spark 3.0.x and 3.1.x -- This message was sent by Atlassian Jira (v8.3.4#803005)