+1 overall

On Sat, Mar 19, 2022 at 5:02 PM Surya Prasanna <[email protected]>
wrote:

> Hi Sagar,
> Sorry for the delay in response. Thanks for the questions.
>
> 1. Trying to understand the main goal. Is it to balance the tradeoff
> between read and write amplification for metadata table? Or is it purely to
> optimize for reads?
> > On large tables, write amplification is a side effect of frequent
> compactions. So, instead of increasing the frequency of full compaction, we
> are proposing minor compaction(LogCompaction) to be done frequently to
> merge only the log blocks and write a new log block. By merging the blocks,
> there are less no. of blocks to deal with during read, that way we are
> optimizing for read performance and potentially avoiding the write
> amplification problem.
>
> 2. Why do we need a separate action? Why can't any of the existing
> compaction strategies (or a new one if needed) help to achieve this?
> > A new compaction strategy can be added, but we thought it might
> complicate the existing logic and need to rely on some hacks, especially
> since Compaction action writes to a base file and places a .commit file
> upon completion. Whereas, in our use case we are not concerned with the
> base file at all, instead we are merging blocks and writing back to the log
> file. So, we thought it is better to use a new action(called
> LogCompaction), which works at a log file level and writes back to the log
> file. Since log files are in general added by deltacommit, upon completion
> LogCompaction can place a .deltacommit.
>
> 3. Is the proposed LogCompaction a replacement for regular compaction for
> metadata table i.e. if LogCompaction is enabled then compaction cannot be
> done?
> > LogCompaction is not a replacement for regular compaction. LogCompaction
> is performed as a minor compaction so as to reduce the no. of log blocks to
> consider. It does not consider base files while merging the log blocks. To
> merge log files with base file Compaction action is still needed. By using
> LogCompaction action frequently, the frequency with which we do full scale
> compaction is reduced.
> Consider a scenario in which, after 'X' no. of LogCompaction actions, for
> some file groups the log files size becomes comparable to that of base file
> size, in this scenario LogCompaction action is going to take close to the
> same amount of time as compaction action. So, now instead of LogCompaction,
> full scale Compaction needs to be performed on those file groups. In future
> we can also introduce logic to determine what is the right
> action(Compaction or LogCompaction) to be performed depending on the state
> of the file group.
>
> Thanks,
> Surya
>
>
> On Fri, Mar 18, 2022 at 11:22 PM Surya Prasanna Yalla <[email protected]>
> wrote:
>
> >
> >
> > ---------- Forwarded message ---------
> > From: sagar sumit <[email protected]>
> > Date: Wed, Mar 16, 2022 at 11:17 PM
> > Subject: Re: [DISCUSS] New RFC to create LogCompaction action for MOR
> > tables?
> > To: <[email protected]>
> >
> >
> > Hi Surya,
> >
> > This is a very interesting idea! I'll be looking forward to RFC.
> >
> > I have a few high-level questions:
> >
> > 1. Trying to understand the main goal. Is it to balance the tradeoff
> > between read and write amplification for metadata table? Or is it purely
> to
> > optimize for reads?
> > 2. Why do we need a separate action? Why can't any of the existing
> > compaction strategies (or a new one if needed) help to achieve this?
> > 3. Is the proposed LogCompaction a replacement for regular compaction for
> > metadata table i.e. if LogCompaction is enabled then compaction cannot be
> > done?
> >
> > Regards,
> > Sagar
> >
> > On Thu, Mar 17, 2022 at 12:51 AM Surya Prasanna <
> > [email protected]>
> > wrote:
> >
> > > Hi Team,
> > >
> > >
> > > Record level index uses a metadata table which is a MOR table type.
> > >
> > > Each delta commit in the metadata table creates multiple hfile log
> blocks
> > > and so to read them multiple file handles have to be opened which might
> > > cause issues in read performance. To reduce the read performance,
> > > compaction can be run frequently which basically merges all the log
> > blocks
> > > to base file and creates another version of base file. If this is done
> > > frequently, it would cause write amplification.
> > >
> > > Instead of merging all the log blocks to base file and doing a full
> > > compaction, minor compaction can be done which basically merges log
> > blocks
> > > and creates one new log block.
> > >
> > > This can be achieved by adding a new action to Hudi called
> LogCompaction
> > > and requires a RFC. Please let me know what you think.
> > >
> > >
> > > Thanks,
> > >
> > > Surya
> > >
> >
>

Reply via email to