Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

Vinoth Chandar Sun, 24 May 2020 23:26:48 -0700

Thank you for your patience...

This was a thought provoking RFC. I think we can solve a even more
generalized problem here.. data clustering (which we support in limited
form for only bulk_insert today).


Please read my comment here..

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction?focusedCommentId=154995482#comment-154995482


Few notable changes I am suggesting :

   - First of all, let's give this a better action (IMHO) name : `
   *clustering*` (since it clusters filegroups together based on some
   criteria; we will get to these later). We will continue referring to what
   we do today as `*compaction*`.
   - Let's implement this as a "Write mode", rather than a new append API
   ?  I would like to keep the things simple to insert, delete, update.. like
   it is now.. As you will see below, what I am suggesting is a generalization
   of what was proposed in the RFC. If we are going to collapse file groups,
   then might as well do things like sorting (we already support this for
   bulk_insert alone) to speed up queries? Also an user can want to do this
   clustering, without having the need to write small files/quickly ingest as
   well ..
   - We should assume that we will cluster N input file groups into M
   output file groups, not just 1 output file group. Say we want to target
   file size of 256MB, then it might turn out all your accumulated small
   groups are worth about 450MB, requiring two file groups instead of 1. (this
   introduces few limitations as we will see)




On Tue, May 19, 2020 at 6:21 PM leesf <leesf0...@gmail.com> wrote:

> +1 from me, also I updated the RFC-19, please take another look when you
> get a chance.
>
> Vinoth Chandar <vin...@apache.org> 于2020年5月20日周三 上午1:43写道：
>
> > Bear with me for 1-2 days.. Will circle around on this.. This is a dear
> > topic to me as well :)
> >
> > On Tue, May 19, 2020 at 9:21 AM Shiyan Xu <xu.shiyan.raym...@gmail.com>
> > wrote:
> >
> > > Hi Wei,
> > >
> > > +1 on the proposal; append-only is a commonly seen use case.
> > >
> > > IIUC, the main concern is, Hudi by default generates small files
> > internally
> > > in COW tables. And by setting `hoodie.parquet.small.file.limit` can
> > reduce
> > > the number of small files but slow down the pipeline (by doing
> > compaction).
> > >
> > > To the option you mentioned
> > > When writing to parquet directly, do you consider setting params for
> bulk
> > > write? It should be possible to make bulk write bounded by time and
> size
> > so
> > > that you can always have a reasonable size for the output.
> > >
> > > I agree with Vinoth's point
> > > > The main blocker for us to send inserts into logs, is having the
> > ability
> > > to
> > > do log indexing (we wanted to support someone who may want to do
> inserts
> > > and suddenly wants to upsert the table)
> > >
> > > Logs are most of time append-only. Due to GDPR or other compliance, we
> > may
> > > have to scrub some fields later.
> > > Looks like we may phase the support. 1 is to write parquet as log
> files.
> > 2
> > > is to support upsert on demand. This seems to be a different table type
> > > (neither COW nor MOR. Sounds like Merge-on-demand?)
> > >
> > >
> > >
> > > On Sun, May 17, 2020 at 10:10 AM wei li <lw309637...@gmail.com> wrote:
> > >
> > > > Thanks, Vinoth Chandar
> > > > Just like
> > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112
> > > ,
> > > > we need  a mechanism to  solve two issues.
> > > > 1.  On the write side: do not compaction for faster write. (now merge
> > on
> > > > read can solve this problem)
> > > > 2. compaction and read : also a mechanism to collapse older smaller
> > files
> > > > into larger ones while also keeping the query cost low.(if use merge
> on
> > > > read, if do not compaction, the realtime read will slow)
> > > >
> > > > we have a option:
> > > > 1. On the write side: just write parquet, not compaction
> > > > 2. compaction and read : because the small file is parquet, the
> > realtime
> > > > read can be fast, also user can use asynchronous compaction to
> > collapse
> > > > older smaller parquet files into larger parquet files
> > > >
> > > > Best Regards,
> > > > Wei Li.
> > > >
> > > > On 2020/05/14 16:54:24, Vinoth Chandar <vin...@apache.org> wrote:
> > > > > Hi Wei,
> > > > >
> > > > > Thanks for starting this thread. I am trying to understand your
> > > concern -
> > > > > which seems to be that for inserts, we write parquet files instead
> of
> > > > > logging?  FWIW Hudi already supports asynchronous compaction...
> and a
> > > > > record reader flag that can avoid merging for cases where there are
> > > only
> > > > > inserts..
> > > > >
> > > > > The main blocker for us to send inserts into logs, is having the
> > > ability
> > > > to
> > > > > do log indexing (we wanted to support someone who may want to do
> > > inserts
> > > > > and suddenly wants to upsert the table).. If we can sacrifice on
> that
> > > > > initially, it's very doable.
> > > > >
> > > > > Will wait for others to chime in as well.
> > > > >
> > > > > On Thu, May 14, 2020 at 9:06 AM wei li <lw309637...@gmail.com>
> > wrote:
> > > > >
> > > > > > The business scenarios of the data lake mainly include analysis
> of
> > > > > > databases, logs, and files.
> > > > > > [image: 11111.jpg]
> > > > > >
> > > > > > At present, hudi can better support the scenario where the
> database
> > > > cdc is
> > > > > > incrementally written to hudi, and it is also doing bulkload
> files
> > to
> > > > hudi.
> > > > > >
> > > > > > However, there is no good native support for log scenarios
> > (requiring
> > > > > > high-throughput writes, no updates, deletions, and focusing on
> > small
> > > > file
> > > > > > scenarios);now can write through inserts without deduplication,
> but
> > > > they
> > > > > > will still merge on the write side.
> > > > > >
> > > > > >    - In copy on write mode when "hoodie.parquet.small.file.limit"
> > is
> > > > > >    100MB, but  every batch small  will cost some time for
> merge,it
> > > > will reduce
> > > > > >    write throughput.
> > > > > >    - This scene is not suitable for  merge on read.
> > > > > >    - the actual scenario only needs to write parquet in batches
> > when
> > > > > >    writing, and then provide reverse compaction (similar to delta
> > > lake
> > > > )
> > > > > >
> > > > > >
> > > > > > I created an RFC with more details
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > > Wei Li.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

Reply via email to