[ https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liwei resolved HUDI-897. ------------------------ Resolution: Fixed > hudi support log append scenario with better write and asynchronous compaction > ------------------------------------------------------------------------------ > > Key: HUDI-897 > URL: https://issues.apache.org/jira/browse/HUDI-897 > Project: Apache Hudi > Issue Type: Improvement > Components: Compaction, Performance > Affects Versions: 0.9.0 > Reporter: liwei > Assignee: liwei > Priority: Major > Fix For: 0.9.0 > > Attachments: image-2020-05-14-19-51-37-938.png, > image-2020-05-14-20-14-59-429.png > > > 一、scenario > The business scenarios of the data lake mainly include analysis of databases, > logs, and files. > !image-2020-05-14-20-14-59-429.png|width=444,height=286! > Databricks delta lake also aim at these three scenario. [1] > > 二、Hudi current situation > At present, hudi can better support the scenario where the database cdc is > incrementally written to hudi, and it is also doing bulkload files to hudi. > However, there is no good native support for log scenarios (requiring > high-throughput writes, no updates, deletions, and focusing on small file > scenarios);now can write through inserts without deduplication, but they will > still merge on the write side. > * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but > every batch small will cost some time for merge,it will reduce write > throughput. > * This scene is not suitable for merge on read. > * the actual scenario only needs to write parquet in batches when writing, > and then provide reverse compaction (similar to delta lake ) > 三、what we can do > > 1.On the write side, just write every batch to parquet file base on the > snapshot mechanism,default open the merge,use can close the auto merge for > more write throughput. > 2. hudi support asynchronous merge small parquet files like databricks delta > lake's OPTIMIZE command [2] > > [1] [https://databricks.com/product/delta-lake-on-databricks] > [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)