[jira] [Updated] (HUDI-4071) Better Spark Datasource default configs

Ethan Guo (Jira) Wed, 15 Feb 2023 23:22:08 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-4071:
----------------------------
    Fix Version/s: 0.13.0

> Better Spark Datasource default configs
> ---------------------------------------
>
>                 Key: HUDI-4071
>                 URL: https://issues.apache.org/jira/browse/HUDI-4071
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Assignee: Sagar Sumit
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.12.0, 0.13.0
>
>
> Default configs should be:
>  # Optimized for insert/bulk_insert e.g. by default if we have NONE sort mode 
> then it's as good as parquet writes with some additional work for meta 
> columns. An extension of this is to keep a map of minimal optimized configs 
> per operation type. This is partly related to better performant configs 
> HUDI-2151
>  # Make reasonable assumptions, e.g. for index type, bloom filter does not 
> rely on any external system, so it can be a better default candidate than 
> let's say HBase index.
>  # Scout all configs with noDefaultValue and assign a default if necessary.
>  # Keep spark-sql and spark datasource config keys same as much as possible, 
> otherwise it's difficult operationally for the user. Rename/reuse existing 
> datasource keys that are meant for same purpose. This is related to HUDI-4070 
> as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4071) Better Spark Datasource default configs

Reply via email to