I have s doubt on the design. I guess this is the right place to discuss.

I want to understand how compaction interplays with this new scheme.
Let's assume all log block are of new format only. Once compaction
completes, those log blocks/files not compacted will have range info
pertaining to compacted ones right? When this will get fixed? Won't the
look up return true for those keys from compacted log files. I have
attached two diagrams depicting before and after compaction. If you look at
2nd pic (after compaction), ideally min and max should have been 6 and 11.

In general, when does the key range pruning will happen? And will the bloom
filter also be adjusted?


On Wed, Oct 30, 2019 at 10:09 PM Nishith <[email protected]> wrote:

> Thanks for the detailed design write up Vinoth. I concur with the others
> on option 2, default indexing as off and enable it when we have enough
> confidence on stability & performance. Although, I do think practically it
> might be good to have the code in place for users who might revert to an
> older build as part of some build rollback mechanisms that they may have in
> place (for reasons not even related to hudi). The latest data block
> (denoted by the latest version) being a new block as suggested by Balaji
> sounds like one option - not sure how the complicated the code will become
> though...
> Will comment on the RFC about some doubts/concerns regarding first
> migration customers from canIndexLogFiles = false to true and then rollback
> to ensure my understand is correct.
>
> -Nishith
>
> Sent from my iPhone
>
> > On Oct 30, 2019, at 4:00 PM, Balaji Varadarajan
> <[email protected]> wrote:
> >
> > Thanks Vinoth for proposing a clean and extendable design. The overall
> design looks great. Another rollout option is to only use consolidated log
> index for index lookup if latest "valid" log block has been written in new
> format. If that is not the case, we can revert to scanning previous log
> blocks for index lookup.
> > Balaji.V    On Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani Sudha
> <[email protected]> wrote:
> >
> > I vote for the second option. Also it can give time to analyze on how to
> > deal with backwards compatibility. I ll take a look at the RFC later
> > tonight and get back.
> >
> >
> >> On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <[email protected]>
> wrote:
> >>
> >> One issue I have some open questions myself
> >>
> >> Is it ok to assume log will have old data block versions, followed by
> new
> >> data block versions. For e.g, if rollout new code, then revert back then
> >> there could be an arbitrary mix of new and old data blocks. Handling
> this
> >> might make design/code fairly complex. Alternatively we can keep it
> simple
> >> for now, disable by default and only advise to enable for new tables or
> >> when hudi version is stable
> >>
> >>
> >>> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <[email protected]>
> wrote:
> >>>
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file
> >>>
> >>>
> >>> Feedback welcome, on this RFC tackling HUDI-86
> >>>
> >>
>


-- 
Regards,
-Sivabalan

Reply via email to