Hi Wei,

Thanks for starting this thread. I am trying to understand your concern -
which seems to be that for inserts, we write parquet files instead of
logging?  FWIW Hudi already supports asynchronous compaction... and a
record reader flag that can avoid merging for cases where there are only
inserts..

The main blocker for us to send inserts into logs, is having the ability to
do log indexing (we wanted to support someone who may want to do inserts
and suddenly wants to upsert the table).. If we can sacrifice on that
initially, it's very doable.

Will wait for others to chime in as well.

On Thu, May 14, 2020 at 9:06 AM wei li <lw309637...@gmail.com> wrote:

> The business scenarios of the data lake mainly include analysis of
> databases, logs, and files.
> [image: 11111.jpg]
>
> At present, hudi can better support the scenario where the database cdc is
> incrementally written to hudi, and it is also doing bulkload files to hudi.
>
> However, there is no good native support for log scenarios (requiring
> high-throughput writes, no updates, deletions, and focusing on small file
> scenarios);now can write through inserts without deduplication, but they
> will still merge on the write side.
>
>    - In copy on write mode when "hoodie.parquet.small.file.limit" is
>    100MB, but  every batch small  will cost some time for merge,it will reduce
>    write throughput.
>    - This scene is not suitable for  merge on read.
>    - the actual scenario only needs to write parquet in batches when
>    writing, and then provide reverse compaction (similar to delta lake )
>
>
> I created an RFC with more details
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
>
>
> Best Regards,
> Wei Li.
>
>
>

Reply via email to