yes. although I think we can also store the bloom filter in log block footer, this way if we want to log avro we can still do it. And indexing would work. I plan to explain this clearly in an RFC. yes. we will definitely add other metadata into the headers :)
Our log format is pretty flexible. (good job Nishith! :)) On Thu, Oct 24, 2019 at 5:16 AM Jaimin Shah <shahjaimin0...@gmail.com> wrote: > Hi Vinoth, > I looked at the example. I think this will enable faster realtime and > incremental view. Along with that having boomfilter footer the same way > which we have in parquet files can help to index log files also? > We are storing parquet length which helps us to read the file. Will > having json header with file type, starting offset and length will help? I > think it will enable us to store multiple file of same format or different > format ( parquet + Hbase) etc ? > > Thanks, > Jaimin > > On Thu, 24 Oct 2019 at 02:37, Kabeer Ahmed <kab...@linuxmail.org> wrote: > > > Vinoth, > > > > Thanks for clarification. :-). > > I looked at the email from a periphery without getting into details. I > > will review it thoroughly in few days and catch up. > > Thanks, > > On Oct 23 2019, at 3:06 pm, Vinoth Chandar <vin...@apache.org> wrote: > > > yes. to the append log, that is used for compaction. My guess was > > Kabeer's > > > concern was around actually sending user data into debug logs > > (slf4j/log4j) > > > which we dont. > > > > > > On the second part. yes, we want to option to write parquet data inline > > > instead of avro. Once we harden this, any other format e.g Orc would > also > > > be easy to do. Thats my thinking. WDYT? > > > > > > On Wed, Oct 23, 2019 at 6:28 AM Jaimin Shah <shahjaimin0...@gmail.com> > > > wrote: > > > > > > > Hi Vinoth, > > > > Aren’t we writing user data to append log currently? The way I > > > > understand is that currently data is written in avro which you want > to > > move > > > > to inline parquet. Please correct me if I am missing something. > > > > > > > > Thanks, > > > > Jaimin > > > > > > > > On Wednesday, 23 October 2019, Vinoth Chandar <vin...@apache.org> > > wrote: > > > > > Sure. Take your time! Just to clarify, here log refers to the Hudi > > > > append > > > > > log, not user's log4j or such logs. yes that would be very strange > > to do. > > > > > :) > > > > > > > > > > On Wed, Oct 23, 2019 at 3:06 AM Kabeer Ahmed <kab...@linuxmail.org > > > > > > wrote: > > > > > > > > > > > Hi Vinoth, > > > > > > Have crazy week and the next 2 to 3 weeks are going to be very > > busy. I > > > > > > havent had a chance to look into this. > > > > > > My thoughts are around security. The ideas of building external > > indexes > > > > > > come with loads of advantages and throwing user data into the > logs > > etc > > > > > > makes me anxious. Let me do a deep dive and come back to you. > > > > > > Thanks > > > > > > Kabeer. > > > > > > > > > > > > On Oct 21 2019, at 3:07 pm, Vinoth Chandar <vin...@apache.org> > > wrote: > > > > > > > Any thoughts? :) anyone? > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 11:06 AM Vinoth Chandar < > > vin...@apache.org> > > > > > > wrote: > > > > > > > > Hi all, > > > > > > > > Wanted to share some prototyping I was doing for HUDI-46. The > > idea > > > > > > > > > > > > > > > > > > > here is > > > > > > > > to see if we can embed a parquet file "inline" into an outer > > file > > > > > > > > > > > > > > > > > > > > > > > (our > > > > > > > > log), so that if the user chooses to they can also get > parquet > > data > > > > > > > > > > > > > > > > > > > > > > > in > > > > > > the > > > > > > > > logs to speed up real-time view queries. We would be using > the > > > > > > > > > > > > > > > > > > > > > > > standard > > > > > > > > ParquetWriter and ParquetReader on top of a custom FileSystem > > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/vinothchandar/incubator-hudi/commit/ > > > > > c60f4578f794d0f0d0e194b3e509cc0c5f132576 > > > > > > > > Wrote a small PoC with TODOs and gaps annotated. Wanted to > see > > if > > > > > > > > > > > > > > > > > > > > > > > > > > you > > > > > > all > > > > > > > > can poke more holes here and see if can generalize to > > embedding any > > > > > > > > > > > > > > > > > > > file > > > > > > > > for e.g HFile.. > > > > > > > > > > > > > > > > I believe we can generalize it and thus build things like > > external > > > > > > > > indexing very easily on the existing log format. > > > > > > > > > > > > > > > > Thanks > > > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >