Hi Vinoth, I looked at the example. I think this will enable faster realtime and incremental view. Along with that having boomfilter footer the same way which we have in parquet files can help to index log files also? We are storing parquet length which helps us to read the file. Will having json header with file type, starting offset and length will help? I think it will enable us to store multiple file of same format or different format ( parquet + Hbase) etc ?
Thanks, Jaimin On Thu, 24 Oct 2019 at 02:37, Kabeer Ahmed <[email protected]> wrote: > Vinoth, > > Thanks for clarification. :-). > I looked at the email from a periphery without getting into details. I > will review it thoroughly in few days and catch up. > Thanks, > On Oct 23 2019, at 3:06 pm, Vinoth Chandar <[email protected]> wrote: > > yes. to the append log, that is used for compaction. My guess was > Kabeer's > > concern was around actually sending user data into debug logs > (slf4j/log4j) > > which we dont. > > > > On the second part. yes, we want to option to write parquet data inline > > instead of avro. Once we harden this, any other format e.g Orc would also > > be easy to do. Thats my thinking. WDYT? > > > > On Wed, Oct 23, 2019 at 6:28 AM Jaimin Shah <[email protected]> > > wrote: > > > > > Hi Vinoth, > > > Aren’t we writing user data to append log currently? The way I > > > understand is that currently data is written in avro which you want to > move > > > to inline parquet. Please correct me if I am missing something. > > > > > > Thanks, > > > Jaimin > > > > > > On Wednesday, 23 October 2019, Vinoth Chandar <[email protected]> > wrote: > > > > Sure. Take your time! Just to clarify, here log refers to the Hudi > > > append > > > > log, not user's log4j or such logs. yes that would be very strange > to do. > > > > :) > > > > > > > > On Wed, Oct 23, 2019 at 3:06 AM Kabeer Ahmed <[email protected]> > > > wrote: > > > > > > > > > Hi Vinoth, > > > > > Have crazy week and the next 2 to 3 weeks are going to be very > busy. I > > > > > havent had a chance to look into this. > > > > > My thoughts are around security. The ideas of building external > indexes > > > > > come with loads of advantages and throwing user data into the logs > etc > > > > > makes me anxious. Let me do a deep dive and come back to you. > > > > > Thanks > > > > > Kabeer. > > > > > > > > > > On Oct 21 2019, at 3:07 pm, Vinoth Chandar <[email protected]> > wrote: > > > > > > Any thoughts? :) anyone? > > > > > > > > > > > > On Wed, Oct 9, 2019 at 11:06 AM Vinoth Chandar < > [email protected]> > > > > > wrote: > > > > > > > Hi all, > > > > > > > Wanted to share some prototyping I was doing for HUDI-46. The > idea > > > > > > > > > > > > > > > > here is > > > > > > > to see if we can embed a parquet file "inline" into an outer > file > > > > > > > > > > > > > > > > > > > (our > > > > > > > log), so that if the user chooses to they can also get parquet > data > > > > > > > > > > > > > > > > > > > in > > > > > the > > > > > > > logs to speed up real-time view queries. We would be using the > > > > > > > > > > > > > > > > > > > standard > > > > > > > ParquetWriter and ParquetReader on top of a custom FileSystem > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/vinothchandar/incubator-hudi/commit/ > > > > c60f4578f794d0f0d0e194b3e509cc0c5f132576 > > > > > > > Wrote a small PoC with TODOs and gaps annotated. Wanted to see > if > > > > > > > > > > > > > > > > > > > > > you > > > > > all > > > > > > > can poke more holes here and see if can generalize to > embedding any > > > > > > > > > > > > > > > > file > > > > > > > for e.g HFile.. > > > > > > > > > > > > > > I believe we can generalize it and thus build things like > external > > > > > > > indexing very easily on the existing log format. > > > > > > > > > > > > > > Thanks > > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > > > > >
