Re: [DISCUSS] Hudi is the data lake platform

Vinoth Chandar Mon, 19 Apr 2021 16:43:02 -0700

Looks like we have consensus here!  Will share the blog PR here once ready.


Thanks all!

On Fri, Apr 16, 2021 at 8:43 PM Sivabalan <[email protected]> wrote:

> totally +1 on clarifying Hudi's vision.
>
> On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal <[email protected]>
> wrote:
>
> > +1
> >
> > I also believe Hudi is a Data Platform technology providing many
> different
> > functionalities to build modern data lakes, Hudi's table format being
> just
> > one of them. I've been using this perspective in some of the conference
> > talks already ;)
> > With this rebranding (and hopefully some code/package structuring down
> the
> > road..), it's easier for us to communicate the value add of Hudi and its
> > associated features and generate interest for future contributors.
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Apr 13, 2021 at 7:52 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Thanks everyone for the feedback, so far!
> > >
> > > On the incremental aspects, that's actually Hudi's core design
> > > differentiation. While I believe the ETL today is still largely batch
> > > oriented, the way forward for everyone's
> > > benefit is indeed - incremental processing. We have already taken a
> giant
> > > step here for e.g in making raw data ingestion fully incremental using
> > > deltastreamer. We should keep working to crack incremental ETL at
> large.
> > > 100% with your line of thinking!
> > >
> > > It's been in my head for four full years now! :)
> > >
> > >
> >
> https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
> > >
> > > I have started drafting a blog/PR along these lines already. I will
> make
> > it
> > > more final and share here, as we wait couple more days for more
> feedback!
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Apr 13, 2021 at 7:01 PM Danny Chan <[email protected]>
> wrote:
> > >
> > > > +1 for the vision, personally i'm promising the incremental ETL part,
> > > with
> > > > engine like Apache Flink we can do intermediate aggregation in
> > streaming
> > > > style.
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > > > leesf <[email protected]> 于2021年4月14日周三 上午9:52写道：
> > > >
> > > > > +1. Cool and promising.
> > > > >
> > > > > Mehrotra, Udit <[email protected]> 于2021年4月14日周三 上午2:57写道：
> > > > >
> > > > > > Agree with the rebranding Vinoth. Hudi is not just a "table
> format"
> > > and
> > > > > we
> > > > > > need to do justice to all the cool auxiliary features/services we
> > > have
> > > > > > built.
> > > > > >
> > > > > > Also, timeline metadata service in particular would be a really
> big
> > > win
> > > > > if
> > > > > > we move towards something like that.
> > > > > >
> > > > > > On 4/13/21, 11:01 AM, "Pratyaksh Sharma" <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > >     CAUTION: This email originated from outside of the
> > organization.
> > > Do
> > > > > > not click links or open attachments unless you can confirm the
> > sender
> > > > and
> > > > > > know the content is safe.
> > > > > >
> > > > > >
> > > > > >
> > > > > >     Definitely we are doing much more than only ingesting and
> > > managing
> > > > > data
> > > > > >     over DFS.
> > > > > >
> > > > > >     +1 from my side as well. :)
> > > > > >
> > > > > >     On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >     > I love this rebranding. Totally agree. +1
> > > > > >     >
> > > > > >     > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> > > > > > [email protected]>
> > > > > >     > wrote:
> > > > > >     >
> > > > > >     > > +1 The vision looks fantastic.
> > > > > >     > >
> > > > > >     > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <
> [email protected]
> > >
> > > > > wrote:
> > > > > >     > >
> > > > > >     > > > Awesome summary of Hudi! +1 as well.
> > > > > >     > > >
> > > > > >     > > > Gary Li
> > > > > >     > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> > > > > > [email protected]>
> > > > > >     > > > wrote:
> > > > > >     > > > > Excellent, I agree
> > > > > >     > > > >
> > > > > >     > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> > > > > > [email protected]>
> > > > > >     > > > escreveu:
> > > > > >     > > > >
> > > > > >     > > > > > +1 Excited by this new vision!
> > > > > >     > > > > >
> > > > > >     > > > > > Best,
> > > > > >     > > > > > Vino
> > > > > >     > > > > >
> > > > > >     > > > > > Dianjin Wang <[email protected]>
> > > > > 于2021年4月13日周二
> > > > > >     > > 下午3:53写道：
> > > > > >     > > > > >
> > > > > >     > > > > > > +1  The new brand is straightforward, a better
> > > > > description
> > > > > > of
> > > > > >     > Hudi.
> > > > > >     > > > > > >
> > > > > >     > > > > > > Best,
> > > > > >     > > > > > > Dianjin Wang
> > > > > >     > > > > > >
> > > > > >     > > > > > >
> > > > > >     > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > > > >     > > > [email protected]>
> > > > > >     > > > > > > wrote:
> > > > > >     > > > > > >
> > > > > >     > > > > > > > +1 . Cannot agree more. I think this makes
> total
> > > > sense
> > > > > > and will
> > > > > >     > > > provide
> > > > > >     > > > > > > for
> > > > > >     > > > > > > > a much better representation of the project.
> > > > > >     > > > > > > >
> > > > > >     > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth
> Chandar <
> > > > > >     > > [email protected]
> > > > > >     > > > >
> > > > > >     > > > > > > wrote:
> > > > > >     > > > > > > >
> > > > > >     > > > > > > > > Hello all,
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > Reading one more article today, positioning
> > Hudi,
> > > > as
> > > > > > just a
> > > > > >     > > table
> > > > > >     > > > > > > format,
> > > > > >     > > > > > > > > made me wonder, if we have done enough
> justice
> > in
> > > > > > explaining
> > > > > >     > > > what we
> > > > > >     > > > > > > have
> > > > > >     > > > > > > > > built together here.
> > > > > >     > > > > > > > > I tend to think of Hudi as the data lake
> > > platform,
> > > > > > which has
> > > > > >     > > the
> > > > > >     > > > > > > > following
> > > > > >     > > > > > > > > components, of which - one if a table format,
> > one
> > > > is
> > > > > a
> > > > > >     > > > transactional
> > > > > >     > > > > > > > > storage layer.
> > > > > >     > > > > > > > > But the whole stack we have is definitely
> worth
> > > > more
> > > > > > than the
> > > > > >     > > > sum of
> > > > > >     > > > > > > all
> > > > > >     > > > > > > > > the parts IMO (speaking from my own
> experience
> > > from
> > > > > > the past
> > > > > >     > > 10+
> > > > > >     > > > > > years
> > > > > >     > > > > > > of
> > > > > >     > > > > > > > > open source software dev).
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > Here's what we have built so far.
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > a) *table format* : something that stores
> table
> > > > > > schema, a
> > > > > >     > > > metadata
> > > > > >     > > > > > > table
> > > > > >     > > > > > > > > that stores file listing today, and being
> > > extended
> > > > to
> > > > > > store
> > > > > >     > > > column
> > > > > >     > > > > > > ranges
> > > > > >     > > > > > > > > and more in the future (RFC-27)
> > > > > >     > > > > > > > > b) *aux metadata* : bloom filters, external
> > > record
> > > > > > level
> > > > > >     > > indexes
> > > > > >     > > > > > today,
> > > > > >     > > > > > > > > bitmaps/interval trees and other advanced
> > on-disk
> > > > > data
> > > > > >     > > structures
> > > > > >     > > > > > > > tomorrow
> > > > > >     > > > > > > > > c) *concurrency control* : we always
> supported
> > > MVCC
> > > > > > based log
> > > > > >     > > > based
> > > > > >     > > > > > > > > concurrency (serialize writes into a time
> > ordered
> > > > > > log), and
> > > > > >     > we
> > > > > >     > > > now
> > > > > >     > > > > > also
> > > > > >     > > > > > > > > have OCC for batch merge workloads with
> 0.8.0.
> > We
> > > > > will
> > > > > > have
> > > > > >     > > > > > multi-table
> > > > > >     > > > > > > > and
> > > > > >     > > > > > > > > fully non-blocking writers soon (see future
> > work
> > > > > > section of
> > > > > >     > > > RFC-22)
> > > > > >     > > > > > > > > d) *updates/deletes* : this is the
> > > bread-and-butter
> > > > > > use-case
> > > > > >     > > for
> > > > > >     > > > > > Hudi,
> > > > > >     > > > > > > > but
> > > > > >     > > > > > > > > we support primary/unique key constraints and
> > we
> > > > > could
> > > > > > add
> > > > > >     > > > foreign
> > > > > >     > > > > > keys
> > > > > >     > > > > > > > as
> > > > > >     > > > > > > > > an extension, once our transactions can span
> > > > tables.
> > > > > >     > > > > > > > > e) *table services*: a hudi pipeline today is
> > > > > > self-managing -
> > > > > >     > > > sizes
> > > > > >     > > > > > > > files,
> > > > > >     > > > > > > > > cleans, compacts, clusters data, bootstraps
> > > > existing
> > > > > > data -
> > > > > >     > all
> > > > > >     > > > these
> > > > > >     > > > > > > > > actions working off each other without
> blocking
> > > one
> > > > > > another.
> > > > > >     > > (for
> > > > > >     > > > > > most
> > > > > >     > > > > > > > > parts).
> > > > > >     > > > > > > > > f) *data services*: we also have higher level
> > > > > > functionality
> > > > > >     > > with
> > > > > >     > > > > > > > > deltastreamer sources (scalable DFS listing
> > > source,
> > > > > > Kafka,
> > > > > >     > > > Pulsar is
> > > > > >     > > > > > > > > coming, ...and more), incremental ETL
> support,
> > > > > >     > de-duplication,
> > > > > >     > > > commit
> > > > > >     > > > > > > > > callbacks, pre-commit validations are coming,
> > > error
> > > > > > tables
> > > > > >     > have
> > > > > >     > > > been
> > > > > >     > > > > > > > > proposed. I could also envision us building
> > > towards
> > > > > > streaming
> > > > > >     > > > egress,
> > > > > >     > > > > > > > data
> > > > > >     > > > > > > > > monitoring.
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > I also think we should build the following
> > > (subject
> > > > > to
> > > > > >     > separate
> > > > > >     > > > > > > > > DISCUSS/RFCs)
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > g) *caching service*: Hudi specific caching
> > > service
> > > > > > that can
> > > > > >     > > hold
> > > > > >     > > > > > > mutable
> > > > > >     > > > > > > > > data and serve oft-queried data across
> engines.
> > > > > >     > > > > > > > > h) t*imeline metaserver:* We already run a
> > > > metaserver
> > > > > > in
> > > > > >     > spark
> > > > > >     > > > > > > > > writer/drivers, backed by rocksDB & even
> Hudi's
> > > > > > metadata
> > > > > >     > table.
> > > > > >     > > > Let's
> > > > > >     > > > > > > > turn
> > > > > >     > > > > > > > > it into a scalable, sharded metastore, that
> all
> > > > > > engines can
> > > > > >     > use
> > > > > >     > > > to
> > > > > >     > > > > > > obtain
> > > > > >     > > > > > > > > any metadata.
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > To this end, I propose we rebrand to "*Data
> > Lake
> > > > > > Platform*"
> > > > > >     > as
> > > > > >     > > > > > opposed
> > > > > >     > > > > > > to
> > > > > >     > > > > > > > > "ingests & manages storage of large
> analytical
> > > > > > datasets over
> > > > > >     > > DFS
> > > > > >     > > > > > (hdfs
> > > > > >     > > > > > > or
> > > > > >     > > > > > > > > cloud stores)." and convey the scope of our
> > > vision,
> > > > > >     > > > > > > > > given we have already been building towards
> > that.
> > > > It
> > > > > > would
> > > > > >     > also
> > > > > >     > > > > > provide
> > > > > >     > > > > > > > new
> > > > > >     > > > > > > > > contributors a good lens to look at the
> project
> > > > from.
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > (This is very similar to for e.g, the
> evolution
> > > of
> > > > > > Kafka
> > > > > >     > from a
> > > > > >     > > > > > pub-sub
> > > > > >     > > > > > > > > system, to an event streaming platform - with
> > > > > addition
> > > > > > of
> > > > > >     > > > > > > > > MirrorMaker/Connect etc. )
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > Please share your thoughts!
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > > > Thanks
> > > > > >     > > > > > > > > Vinoth
> > > > > >     > > > > > > > >
> > > > > >     > > > > > > >
> > > > > >     > > > > > >
> > > > > >     > > > > >
> > > > > >     > > > >
> > > > > >     > > >
> > > > > >     > >
> > > > > >     >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to