[ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei resolved HUDI-897.
------------------------
    Resolution: Fixed

> hudi support log append scenario with better write and asynchronous compaction
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-897
>                 URL: https://issues.apache.org/jira/browse/HUDI-897
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Compaction, Performance
>    Affects Versions: 0.9.0
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.9.0
>
>         Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to