Re: [DISCUSS] Hudi is the data lake platform

Pratyaksh Sharma Tue, 13 Apr 2021 11:01:23 -0700

Definitely we are doing much more than only ingesting and managing data
over DFS.


+1 from my side as well. :)

On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <[email protected]> wrote:

> I love this rebranding. Totally agree. +1
>
> On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <[email protected]>
> wrote:
>
> > +1 The vision looks fantastic.
> >
> > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <[email protected]> wrote:
> >
> > > Awesome summary of Hudi! +1 as well.
> > >
> > > Gary Li
> > > On 2021/04/13 14:13:24, Rubens Rodrigues <[email protected]>
> > > wrote:
> > > > Excellent, I agree
> > > >
> > > > Em ter, 13 de abr de 2021 07:23, vino yang <[email protected]>
> > > escreveu:
> > > >
> > > > > +1 Excited by this new vision!
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > Dianjin Wang <[email protected]> 于2021年4月13日周二
> > 下午3:53写道：
> > > > >
> > > > > > +1  The new brand is straightforward, a better description of
> Hudi.
> > > > > >
> > > > > > Best,
> > > > > > Dianjin Wang
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 . Cannot agree more. I think this makes total sense and will
> > > provide
> > > > > > for
> > > > > > > a much better representation of the project.
> > > > > > >
> > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > Reading one more article today, positioning Hudi, as just a
> > table
> > > > > > format,
> > > > > > > > made me wonder, if we have done enough justice in explaining
> > > what we
> > > > > > have
> > > > > > > > built together here.
> > > > > > > > I tend to think of Hudi as the data lake platform, which has
> > the
> > > > > > > following
> > > > > > > > components, of which - one if a table format, one is a
> > > transactional
> > > > > > > > storage layer.
> > > > > > > > But the whole stack we have is definitely worth more than the
> > > sum of
> > > > > > all
> > > > > > > > the parts IMO (speaking from my own experience from the past
> > 10+
> > > > > years
> > > > > > of
> > > > > > > > open source software dev).
> > > > > > > >
> > > > > > > > Here's what we have built so far.
> > > > > > > >
> > > > > > > > a) *table format* : something that stores table schema, a
> > > metadata
> > > > > > table
> > > > > > > > that stores file listing today, and being extended to store
> > > column
> > > > > > ranges
> > > > > > > > and more in the future (RFC-27)
> > > > > > > > b) *aux metadata* : bloom filters, external record level
> > indexes
> > > > > today,
> > > > > > > > bitmaps/interval trees and other advanced on-disk data
> > structures
> > > > > > > tomorrow
> > > > > > > > c) *concurrency control* : we always supported MVCC based log
> > > based
> > > > > > > > concurrency (serialize writes into a time ordered log), and
> we
> > > now
> > > > > also
> > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > > > > multi-table
> > > > > > > and
> > > > > > > > fully non-blocking writers soon (see future work section of
> > > RFC-22)
> > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case
> > for
> > > > > Hudi,
> > > > > > > but
> > > > > > > > we support primary/unique key constraints and we could add
> > > foreign
> > > > > keys
> > > > > > > as
> > > > > > > > an extension, once our transactions can span tables.
> > > > > > > > e) *table services*: a hudi pipeline today is self-managing -
> > > sizes
> > > > > > > files,
> > > > > > > > cleans, compacts, clusters data, bootstraps existing data -
> all
> > > these
> > > > > > > > actions working off each other without blocking one another.
> > (for
> > > > > most
> > > > > > > > parts).
> > > > > > > > f) *data services*: we also have higher level functionality
> > with
> > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> > > Pulsar is
> > > > > > > > coming, ...and more), incremental ETL support,
> de-duplication,
> > > commit
> > > > > > > > callbacks, pre-commit validations are coming, error tables
> have
> > > been
> > > > > > > > proposed. I could also envision us building towards streaming
> > > egress,
> > > > > > > data
> > > > > > > > monitoring.
> > > > > > > >
> > > > > > > > I also think we should build the following (subject to
> separate
> > > > > > > > DISCUSS/RFCs)
> > > > > > > >
> > > > > > > > g) *caching service*: Hudi specific caching service that can
> > hold
> > > > > > mutable
> > > > > > > > data and serve oft-queried data across engines.
> > > > > > > > h) t*imeline metaserver:* We already run a metaserver in
> spark
> > > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata
> table.
> > > Let's
> > > > > > > turn
> > > > > > > > it into a scalable, sharded metastore, that all engines can
> use
> > > to
> > > > > > obtain
> > > > > > > > any metadata.
> > > > > > > >
> > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*"
> as
> > > > > opposed
> > > > > > to
> > > > > > > > "ingests & manages storage of large analytical datasets over
> > DFS
> > > > > (hdfs
> > > > > > or
> > > > > > > > cloud stores)." and convey the scope of our vision,
> > > > > > > > given we have already been building towards that. It would
> also
> > > > > provide
> > > > > > > new
> > > > > > > > contributors a good lens to look at the project from.
> > > > > > > >
> > > > > > > > (This is very similar to for e.g, the evolution of Kafka
> from a
> > > > > pub-sub
> > > > > > > > system, to an event streaming platform - with addition of
> > > > > > > > MirrorMaker/Connect etc. )
> > > > > > > >
> > > > > > > > Please share your thoughts!
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to