Hi Sagar,
Sorry for the delay in response. Thanks for the questions.

1. Trying to understand the main goal. Is it to balance the tradeoff
between read and write amplification for metadata table? Or is it purely to
optimize for reads?
> On large tables, write amplification is a side effect of frequent
compactions. So, instead of increasing the frequency of full compaction, we
are proposing minor compaction(LogCompaction) to be done frequently to
merge only the log blocks and write a new log block. By merging the blocks,
there are less no. of blocks to deal with during read, that way we are
optimizing for read performance and potentially avoiding the write
amplification problem.

2. Why do we need a separate action? Why can't any of the existing
compaction strategies (or a new one if needed) help to achieve this?
> A new compaction strategy can be added, but we thought it might
complicate the existing logic and need to rely on some hacks, especially
since Compaction action writes to a base file and places a .commit file
upon completion. Whereas, in our use case we are not concerned with the
base file at all, instead we are merging blocks and writing back to the log
file. So, we thought it is better to use a new action(called
LogCompaction), which works at a log file level and writes back to the log
file. Since log files are in general added by deltacommit, upon completion
LogCompaction can place a .deltacommit.

3. Is the proposed LogCompaction a replacement for regular compaction for
metadata table i.e. if LogCompaction is enabled then compaction cannot be
done?
> LogCompaction is not a replacement for regular compaction. LogCompaction
is performed as a minor compaction so as to reduce the no. of log blocks to
consider. It does not consider base files while merging the log blocks. To
merge log files with base file Compaction action is still needed. By using
LogCompaction action frequently, the frequency with which we do full scale
compaction is reduced.
Consider a scenario in which, after 'X' no. of LogCompaction actions, for
some file groups the log files size becomes comparable to that of base file
size, in this scenario LogCompaction action is going to take close to the
same amount of time as compaction action. So, now instead of LogCompaction,
full scale Compaction needs to be performed on those file groups. In future
we can also introduce logic to determine what is the right
action(Compaction or LogCompaction) to be performed depending on the state
of the file group.

Thanks,
Surya


On Fri, Mar 18, 2022 at 11:22 PM Surya Prasanna Yalla <[email protected]>
wrote:

>
>
> ---------- Forwarded message ---------
> From: sagar sumit <[email protected]>
> Date: Wed, Mar 16, 2022 at 11:17 PM
> Subject: Re: [DISCUSS] New RFC to create LogCompaction action for MOR
> tables?
> To: <[email protected]>
>
>
> Hi Surya,
>
> This is a very interesting idea! I'll be looking forward to RFC.
>
> I have a few high-level questions:
>
> 1. Trying to understand the main goal. Is it to balance the tradeoff
> between read and write amplification for metadata table? Or is it purely to
> optimize for reads?
> 2. Why do we need a separate action? Why can't any of the existing
> compaction strategies (or a new one if needed) help to achieve this?
> 3. Is the proposed LogCompaction a replacement for regular compaction for
> metadata table i.e. if LogCompaction is enabled then compaction cannot be
> done?
>
> Regards,
> Sagar
>
> On Thu, Mar 17, 2022 at 12:51 AM Surya Prasanna <
> [email protected]>
> wrote:
>
> > Hi Team,
> >
> >
> > Record level index uses a metadata table which is a MOR table type.
> >
> > Each delta commit in the metadata table creates multiple hfile log blocks
> > and so to read them multiple file handles have to be opened which might
> > cause issues in read performance. To reduce the read performance,
> > compaction can be run frequently which basically merges all the log
> blocks
> > to base file and creates another version of base file. If this is done
> > frequently, it would cause write amplification.
> >
> > Instead of merging all the log blocks to base file and doing a full
> > compaction, minor compaction can be done which basically merges log
> blocks
> > and creates one new log block.
> >
> > This can be achieved by adding a new action to Hudi called LogCompaction
> > and requires a RFC. Please let me know what you think.
> >
> >
> > Thanks,
> >
> > Surya
> >
>

Reply via email to