I have s doubt on the design. I guess this is the right place to discuss. I want to understand how compaction interplays with this new scheme. Let's assume all log block are of new format only. Once compaction completes, those log blocks/files not compacted will have range info pertaining to compacted ones right? When this will get fixed? Won't the look up return true for those keys from compacted log files. I have attached two diagrams depicting before and after compaction. If you look at 2nd pic (after compaction), ideally min and max should have been 6 and 11.
In general, when does the key range pruning will happen? And will the bloom filter also be adjusted? On Wed, Oct 30, 2019 at 10:09 PM Nishith <[email protected]> wrote: > Thanks for the detailed design write up Vinoth. I concur with the others > on option 2, default indexing as off and enable it when we have enough > confidence on stability & performance. Although, I do think practically it > might be good to have the code in place for users who might revert to an > older build as part of some build rollback mechanisms that they may have in > place (for reasons not even related to hudi). The latest data block > (denoted by the latest version) being a new block as suggested by Balaji > sounds like one option - not sure how the complicated the code will become > though... > Will comment on the RFC about some doubts/concerns regarding first > migration customers from canIndexLogFiles = false to true and then rollback > to ensure my understand is correct. > > -Nishith > > Sent from my iPhone > > > On Oct 30, 2019, at 4:00 PM, Balaji Varadarajan > <[email protected]> wrote: > > > > Thanks Vinoth for proposing a clean and extendable design. The overall > design looks great. Another rollout option is to only use consolidated log > index for index lookup if latest "valid" log block has been written in new > format. If that is not the case, we can revert to scanning previous log > blocks for index lookup. > > Balaji.V On Tuesday, October 29, 2019, 07:52:00 PM PDT, Bhavani Sudha > <[email protected]> wrote: > > > > I vote for the second option. Also it can give time to analyze on how to > > deal with backwards compatibility. I ll take a look at the RFC later > > tonight and get back. > > > > > >> On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar <[email protected]> > wrote: > >> > >> One issue I have some open questions myself > >> > >> Is it ok to assume log will have old data block versions, followed by > new > >> data block versions. For e.g, if rollout new code, then revert back then > >> there could be an arbitrary mix of new and old data blocks. Handling > this > >> might make design/code fairly complex. Alternatively we can keep it > simple > >> for now, disable by default and only advise to enable for new tables or > >> when hudi version is stable > >> > >> > >>> On Sun, Oct 27, 2019 at 12:13 AM Vinoth Chandar <[email protected]> > wrote: > >>> > >>> > >>> > >> > https://cwiki.apache.org/confluence/display/HUDI/RFC-6+Add+indexing+support+to+the+log+file > >>> > >>> > >>> Feedback welcome, on this RFC tackling HUDI-86 > >>> > >> > -- Regards, -Sivabalan
