Re: [DISCUSS] Hudi is the data lake platform

nishith agarwal Wed, 14 Apr 2021 00:43:57 -0700

+1

I also believe Hudi is a Data Platform technology providing many different
functionalities to build modern data lakes, Hudi's table format being just
one of them. I've been using this perspective in some of the conference
talks already ;)
With this rebranding (and hopefully some code/package structuring down the
road..), it's easier for us to communicate the value add of Hudi and its
associated features and generate interest for future contributors.


Thanks,
Nishith


On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar <[email protected]> wrote:

> Thanks everyone for the feedback, so far!
>
> On the incremental aspects, that's actually Hudi's core design
> differentiation. While I believe the ETL today is still largely batch
> oriented, the way forward for everyone's
> benefit is indeed - incremental processing. We have already taken a giant
> step here for e.g in making raw data ingestion fully incremental using
> deltastreamer. We should keep working to crack incremental ETL at large.
> 100% with your line of thinking!
>
> It's been in my head for four full years now! :)
>
> https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
>
> I have started drafting a blog/PR along these lines already. I will make it
> more final and share here, as we wait couple more days for more feedback!
>
> Thanks
> Vinoth
>
> On Tue, Apr 13, 2021 at 7:01 PM Danny Chan <[email protected]> wrote:
>
> > +1 for the vision, personally i'm promising the incremental ETL part,
> with
> > engine like Apache Flink we can do intermediate aggregation in streaming
> > style.
> >
> > Best,
> > Danny Chan
> >
> > leesf <[email protected]> 于2021年4月14日周三 上午9:52写道：
> >
> > > +1. Cool and promising.
> > >
> > > Mehrotra, Udit <[email protected]> 于2021年4月14日周三 上午2:57写道：
> > >
> > > > Agree with the rebranding Vinoth. Hudi is not just a "table format"
> and
> > > we
> > > > need to do justice to all the cool auxiliary features/services we
> have
> > > > built.
> > > >
> > > > Also, timeline metadata service in particular would be a really big
> win
> > > if
> > > > we move towards something like that.
> > > >
> > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" <[email protected]>
> > wrote:
> > > >
> > > >     CAUTION: This email originated from outside of the organization.
> Do
> > > > not click links or open attachments unless you can confirm the sender
> > and
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > >     Definitely we are doing much more than only ingesting and
> managing
> > > data
> > > >     over DFS.
> > > >
> > > >     +1 from my side as well. :)
> > > >
> > > >     On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <[email protected]>
> > > > wrote:
> > > >
> > > >     > I love this rebranding. Totally agree. +1
> > > >     >
> > > >     > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > > [email protected]>
> > > >     > wrote:
> > > >     >
> > > >     > > +1 The vision looks fantastic.
> > > >     > >
> > > >     > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <[email protected]>
> > > wrote:
> > > >     > >
> > > >     > > > Awesome summary of Hudi! +1 as well.
> > > >     > > >
> > > >     > > > Gary Li
> > > >     > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > > > [email protected]>
> > > >     > > > wrote:
> > > >     > > > > Excellent, I agree
> > > >     > > > >
> > > >     > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > > > [email protected]>
> > > >     > > > escreveu:
> > > >     > > > >
> > > >     > > > > > +1 Excited by this new vision!
> > > >     > > > > >
> > > >     > > > > > Best,
> > > >     > > > > > Vino
> > > >     > > > > >
> > > >     > > > > > Dianjin Wang <[email protected]>
> > > 于2021年4月13日周二
> > > >     > > 下午3:53写道：
> > > >     > > > > >
> > > >     > > > > > > +1  The new brand is straightforward, a better
> > > description
> > > > of
> > > >     > Hudi.
> > > >     > > > > > >
> > > >     > > > > > > Best,
> > > >     > > > > > > Dianjin Wang
> > > >     > > > > > >
> > > >     > > > > > >
> > > >     > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > >     > > > [email protected]>
> > > >     > > > > > > wrote:
> > > >     > > > > > >
> > > >     > > > > > > > +1 . Cannot agree more. I think this makes total
> > sense
> > > > and will
> > > >     > > > provide
> > > >     > > > > > > for
> > > >     > > > > > > > a much better representation of the project.
> > > >     > > > > > > >
> > > >     > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > > >     > > [email protected]
> > > >     > > > >
> > > >     > > > > > > wrote:
> > > >     > > > > > > >
> > > >     > > > > > > > > Hello all,
> > > >     > > > > > > > >
> > > >     > > > > > > > > Reading one more article today, positioning Hudi,
> > as
> > > > just a
> > > >     > > table
> > > >     > > > > > > format,
> > > >     > > > > > > > > made me wonder, if we have done enough justice in
> > > > explaining
> > > >     > > > what we
> > > >     > > > > > > have
> > > >     > > > > > > > > built together here.
> > > >     > > > > > > > > I tend to think of Hudi as the data lake
> platform,
> > > > which has
> > > >     > > the
> > > >     > > > > > > > following
> > > >     > > > > > > > > components, of which - one if a table format, one
> > is
> > > a
> > > >     > > > transactional
> > > >     > > > > > > > > storage layer.
> > > >     > > > > > > > > But the whole stack we have is definitely worth
> > more
> > > > than the
> > > >     > > > sum of
> > > >     > > > > > > all
> > > >     > > > > > > > > the parts IMO (speaking from my own experience
> from
> > > > the past
> > > >     > > 10+
> > > >     > > > > > years
> > > >     > > > > > > of
> > > >     > > > > > > > > open source software dev).
> > > >     > > > > > > > >
> > > >     > > > > > > > > Here's what we have built so far.
> > > >     > > > > > > > >
> > > >     > > > > > > > > a) *table format* : something that stores table
> > > > schema, a
> > > >     > > > metadata
> > > >     > > > > > > table
> > > >     > > > > > > > > that stores file listing today, and being
> extended
> > to
> > > > store
> > > >     > > > column
> > > >     > > > > > > ranges
> > > >     > > > > > > > > and more in the future (RFC-27)
> > > >     > > > > > > > > b) *aux metadata* : bloom filters, external
> record
> > > > level
> > > >     > > indexes
> > > >     > > > > > today,
> > > >     > > > > > > > > bitmaps/interval trees and other advanced on-disk
> > > data
> > > >     > > structures
> > > >     > > > > > > > tomorrow
> > > >     > > > > > > > > c) *concurrency control* : we always supported
> MVCC
> > > > based log
> > > >     > > > based
> > > >     > > > > > > > > concurrency (serialize writes into a time ordered
> > > > log), and
> > > >     > we
> > > >     > > > now
> > > >     > > > > > also
> > > >     > > > > > > > > have OCC for batch merge workloads with 0.8.0. We
> > > will
> > > > have
> > > >     > > > > > multi-table
> > > >     > > > > > > > and
> > > >     > > > > > > > > fully non-blocking writers soon (see future work
> > > > section of
> > > >     > > > RFC-22)
> > > >     > > > > > > > > d) *updates/deletes* : this is the
> bread-and-butter
> > > > use-case
> > > >     > > for
> > > >     > > > > > Hudi,
> > > >     > > > > > > > but
> > > >     > > > > > > > > we support primary/unique key constraints and we
> > > could
> > > > add
> > > >     > > > foreign
> > > >     > > > > > keys
> > > >     > > > > > > > as
> > > >     > > > > > > > > an extension, once our transactions can span
> > tables.
> > > >     > > > > > > > > e) *table services*: a hudi pipeline today is
> > > > self-managing -
> > > >     > > > sizes
> > > >     > > > > > > > files,
> > > >     > > > > > > > > cleans, compacts, clusters data, bootstraps
> > existing
> > > > data -
> > > >     > all
> > > >     > > > these
> > > >     > > > > > > > > actions working off each other without blocking
> one
> > > > another.
> > > >     > > (for
> > > >     > > > > > most
> > > >     > > > > > > > > parts).
> > > >     > > > > > > > > f) *data services*: we also have higher level
> > > > functionality
> > > >     > > with
> > > >     > > > > > > > > deltastreamer sources (scalable DFS listing
> source,
> > > > Kafka,
> > > >     > > > Pulsar is
> > > >     > > > > > > > > coming, ...and more), incremental ETL support,
> > > >     > de-duplication,
> > > >     > > > commit
> > > >     > > > > > > > > callbacks, pre-commit validations are coming,
> error
> > > > tables
> > > >     > have
> > > >     > > > been
> > > >     > > > > > > > > proposed. I could also envision us building
> towards
> > > > streaming
> > > >     > > > egress,
> > > >     > > > > > > > data
> > > >     > > > > > > > > monitoring.
> > > >     > > > > > > > >
> > > >     > > > > > > > > I also think we should build the following
> (subject
> > > to
> > > >     > separate
> > > >     > > > > > > > > DISCUSS/RFCs)
> > > >     > > > > > > > >
> > > >     > > > > > > > > g) *caching service*: Hudi specific caching
> service
> > > > that can
> > > >     > > hold
> > > >     > > > > > > mutable
> > > >     > > > > > > > > data and serve oft-queried data across engines.
> > > >     > > > > > > > > h) t*imeline metaserver:* We already run a
> > metaserver
> > > > in
> > > >     > spark
> > > >     > > > > > > > > writer/drivers, backed by rocksDB & even Hudi's
> > > > metadata
> > > >     > table.
> > > >     > > > Let's
> > > >     > > > > > > > turn
> > > >     > > > > > > > > it into a scalable, sharded metastore, that all
> > > > engines can
> > > >     > use
> > > >     > > > to
> > > >     > > > > > > obtain
> > > >     > > > > > > > > any metadata.
> > > >     > > > > > > > >
> > > >     > > > > > > > > To this end, I propose we rebrand to "*Data Lake
> > > > Platform*"
> > > >     > as
> > > >     > > > > > opposed
> > > >     > > > > > > to
> > > >     > > > > > > > > "ingests & manages storage of large analytical
> > > > datasets over
> > > >     > > DFS
> > > >     > > > > > (hdfs
> > > >     > > > > > > or
> > > >     > > > > > > > > cloud stores)." and convey the scope of our
> vision,
> > > >     > > > > > > > > given we have already been building towards that.
> > It
> > > > would
> > > >     > also
> > > >     > > > > > provide
> > > >     > > > > > > > new
> > > >     > > > > > > > > contributors a good lens to look at the project
> > from.
> > > >     > > > > > > > >
> > > >     > > > > > > > > (This is very similar to for e.g, the evolution
> of
> > > > Kafka
> > > >     > from a
> > > >     > > > > > pub-sub
> > > >     > > > > > > > > system, to an event streaming platform - with
> > > addition
> > > > of
> > > >     > > > > > > > > MirrorMaker/Connect etc. )
> > > >     > > > > > > > >
> > > >     > > > > > > > > Please share your thoughts!
> > > >     > > > > > > > >
> > > >     > > > > > > > > Thanks
> > > >     > > > > > > > > Vinoth
> > > >     > > > > > > > >
> > > >     > > > > > > >
> > > >     > > > > > >
> > > >     > > > > >
> > > >     > > > >
> > > >     > > >
> > > >     > >
> > > >     >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to