[jira] [Commented] (HUDI-5828) Support df.write.forma("hudi") with out any additional options

sivabalan narayanan (Jira) Tue, 21 Feb 2023 15:02:07 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691819#comment-17691819
 ]


sivabalan narayanan commented on HUDI-5828:
-------------------------------------------

As per our quick start guide, 

we have 5 configs that are required. 

1. shuffle parallelism.

2. record key 

3. partition path

4. precombine 

5. table name 

 

1: with 0.13.0, we have already relaxed and this is not a mandatory field. It 
wasn't mandatory even before, but with 0.13.0, our parallelism is dynamically 
derived from the incoming df 

2: With auto generation of record keys support, we should be able to relax this 
constraint. 

3: We are adding support to infer partition from the incoming df's with 
https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But 
some follow up is required though. for non-partitioned, we need to infer that 
incoming df is non-partitioned and choose NonPartitioned as key gen class. If 
not, default key gen class is SimpleKeyGen. But this might work w/o any 
additional fixes for simple partition path. 
4. preCombine: this is already an optional field and users don't need to supply 
them. 

5: table name: This is somewhat tricky. 

We can auto generate some hudi table name, but when hive sync is enabled, we 
should not generating it automatically. Since with external metastores, no two 
tables will have same name and they should have meaningful names, we can't auto 
generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf 
etc. So, here is what we can do.

 

User flow1: 

For a user who uses just spark ds to write and read. 

a. Auto generate hoodie.table.name if user does not supply one. The auto 
generated table name will get serialized into hoodie.properties. 

 

User flow2: 

User who writes via spark and syncs to hive on every commit. 

User does not need to supply hoodie.table.name. But user is expected to set 
explicit value for "hoodie.datasource.hive_sync.table". So, auto generated 
table name will get serialized into hoodie.properties, but for hive sync 
purposes, we will choose what user explicitly set for the corresponding config. 

 

User flow3:

Similar to flow2. 

User writes via spark and syncs to hive in a standalone manner and not w/ every 
write. 

Regular writes will proceed as usual, where we will genreate hudi table name 
automatically for the first time. 

When syncing to external metastore, user has to explicitly set value for 
"hoodie.datasource.hive_sync.table". 

 

Format for hoodie table name to auto generate: 

hoodie_table_\{ts}_\{randomInt}

where ts is current timestamp 

and we will also generate a random Integer to accommodate any concurrent 
writer. 

 

 

Summary:

So, putting all of these together, here is where we will stand.

df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)

 

Special handling:

We could even further simplify if need be. 

We can deduce that user has not provided any configs (0 user supplied configs) 
and in such cases, we can choose the default value for 
"hoodie.datasource.write.recordkey.autogen" as true and proceed instead of 
failing. This is somewhat synonymous to how we might set the default key gen 
type to Simple or NonPartitioned. 

 

 

 

 

> Support df.write.forma("hudi") with out any additional options
> --------------------------------------------------------------
>
>                 Key: HUDI-5828
>                 URL: https://issues.apache.org/jira/browse/HUDI-5828
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: writer-core
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> Wrt simplifying the usage of hudi among more users, we should try to see if 
> we can support writing to hudi w/o any options during write. 
>  
> For eg, we can do the following with paruqet writes. 
> {code:java}
> df.write.format("parquet").save(path)
> {code}
>  
> So, for a non-partitioned dataset, we should try if we can support this 
> usability. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-5828) Support df.write.forma("hudi") with out any additional options

Reply via email to