Re: [DISCUSS] Hudi is the data lake platform

leesf Tue, 13 Apr 2021 18:52:58 -0700

+1. Cool and promising.

Mehrotra, Udit <[email protected]> 于2021年4月14日周三 上午2:57写道：


> Agree with the rebranding Vinoth. Hudi is not just a "table format" and we
> need to do justice to all the cool auxiliary features/services we have
> built.
>
> Also, timeline metadata service in particular would be a really big win if
> we move towards something like that.
>
> On 4/13/21, 11:01 AM, "Pratyaksh Sharma" <[email protected]> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
>     Definitely we are doing much more than only ingesting and managing data
>     over DFS.
>
>     +1 from my side as well. :)
>
>     On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <[email protected]>
> wrote:
>
>     > I love this rebranding. Totally agree. +1
>     >
>     > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <
> [email protected]>
>     > wrote:
>     >
>     > > +1 The vision looks fantastic.
>     > >
>     > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <[email protected]> wrote:
>     > >
>     > > > Awesome summary of Hudi! +1 as well.
>     > > >
>     > > > Gary Li
>     > > > On 2021/04/13 14:13:24, Rubens Rodrigues <
> [email protected]>
>     > > > wrote:
>     > > > > Excellent, I agree
>     > > > >
>     > > > > Em ter, 13 de abr de 2021 07:23, vino yang <
> [email protected]>
>     > > > escreveu:
>     > > > >
>     > > > > > +1 Excited by this new vision!
>     > > > > >
>     > > > > > Best,
>     > > > > > Vino
>     > > > > >
>     > > > > > Dianjin Wang <[email protected]> 于2021年4月13日周二
>     > > 下午3:53写道：
>     > > > > >
>     > > > > > > +1  The new brand is straightforward, a better description
> of
>     > Hudi.
>     > > > > > >
>     > > > > > > Best,
>     > > > > > > Dianjin Wang
>     > > > > > >
>     > > > > > >
>     > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
>     > > > [email protected]>
>     > > > > > > wrote:
>     > > > > > >
>     > > > > > > > +1 . Cannot agree more. I think this makes total sense
> and will
>     > > > provide
>     > > > > > > for
>     > > > > > > > a much better representation of the project.
>     > > > > > > >
>     > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
>     > > [email protected]
>     > > > >
>     > > > > > > wrote:
>     > > > > > > >
>     > > > > > > > > Hello all,
>     > > > > > > > >
>     > > > > > > > > Reading one more article today, positioning Hudi, as
> just a
>     > > table
>     > > > > > > format,
>     > > > > > > > > made me wonder, if we have done enough justice in
> explaining
>     > > > what we
>     > > > > > > have
>     > > > > > > > > built together here.
>     > > > > > > > > I tend to think of Hudi as the data lake platform,
> which has
>     > > the
>     > > > > > > > following
>     > > > > > > > > components, of which - one if a table format, one is a
>     > > > transactional
>     > > > > > > > > storage layer.
>     > > > > > > > > But the whole stack we have is definitely worth more
> than the
>     > > > sum of
>     > > > > > > all
>     > > > > > > > > the parts IMO (speaking from my own experience from
> the past
>     > > 10+
>     > > > > > years
>     > > > > > > of
>     > > > > > > > > open source software dev).
>     > > > > > > > >
>     > > > > > > > > Here's what we have built so far.
>     > > > > > > > >
>     > > > > > > > > a) *table format* : something that stores table
> schema, a
>     > > > metadata
>     > > > > > > table
>     > > > > > > > > that stores file listing today, and being extended to
> store
>     > > > column
>     > > > > > > ranges
>     > > > > > > > > and more in the future (RFC-27)
>     > > > > > > > > b) *aux metadata* : bloom filters, external record
> level
>     > > indexes
>     > > > > > today,
>     > > > > > > > > bitmaps/interval trees and other advanced on-disk data
>     > > structures
>     > > > > > > > tomorrow
>     > > > > > > > > c) *concurrency control* : we always supported MVCC
> based log
>     > > > based
>     > > > > > > > > concurrency (serialize writes into a time ordered
> log), and
>     > we
>     > > > now
>     > > > > > also
>     > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will
> have
>     > > > > > multi-table
>     > > > > > > > and
>     > > > > > > > > fully non-blocking writers soon (see future work
> section of
>     > > > RFC-22)
>     > > > > > > > > d) *updates/deletes* : this is the bread-and-butter
> use-case
>     > > for
>     > > > > > Hudi,
>     > > > > > > > but
>     > > > > > > > > we support primary/unique key constraints and we could
> add
>     > > > foreign
>     > > > > > keys
>     > > > > > > > as
>     > > > > > > > > an extension, once our transactions can span tables.
>     > > > > > > > > e) *table services*: a hudi pipeline today is
> self-managing -
>     > > > sizes
>     > > > > > > > files,
>     > > > > > > > > cleans, compacts, clusters data, bootstraps existing
> data -
>     > all
>     > > > these
>     > > > > > > > > actions working off each other without blocking one
> another.
>     > > (for
>     > > > > > most
>     > > > > > > > > parts).
>     > > > > > > > > f) *data services*: we also have higher level
> functionality
>     > > with
>     > > > > > > > > deltastreamer sources (scalable DFS listing source,
> Kafka,
>     > > > Pulsar is
>     > > > > > > > > coming, ...and more), incremental ETL support,
>     > de-duplication,
>     > > > commit
>     > > > > > > > > callbacks, pre-commit validations are coming, error
> tables
>     > have
>     > > > been
>     > > > > > > > > proposed. I could also envision us building towards
> streaming
>     > > > egress,
>     > > > > > > > data
>     > > > > > > > > monitoring.
>     > > > > > > > >
>     > > > > > > > > I also think we should build the following (subject to
>     > separate
>     > > > > > > > > DISCUSS/RFCs)
>     > > > > > > > >
>     > > > > > > > > g) *caching service*: Hudi specific caching service
> that can
>     > > hold
>     > > > > > > mutable
>     > > > > > > > > data and serve oft-queried data across engines.
>     > > > > > > > > h) t*imeline metaserver:* We already run a metaserver
> in
>     > spark
>     > > > > > > > > writer/drivers, backed by rocksDB & even Hudi's
> metadata
>     > table.
>     > > > Let's
>     > > > > > > > turn
>     > > > > > > > > it into a scalable, sharded metastore, that all
> engines can
>     > use
>     > > > to
>     > > > > > > obtain
>     > > > > > > > > any metadata.
>     > > > > > > > >
>     > > > > > > > > To this end, I propose we rebrand to "*Data Lake
> Platform*"
>     > as
>     > > > > > opposed
>     > > > > > > to
>     > > > > > > > > "ingests & manages storage of large analytical
> datasets over
>     > > DFS
>     > > > > > (hdfs
>     > > > > > > or
>     > > > > > > > > cloud stores)." and convey the scope of our vision,
>     > > > > > > > > given we have already been building towards that. It
> would
>     > also
>     > > > > > provide
>     > > > > > > > new
>     > > > > > > > > contributors a good lens to look at the project from.
>     > > > > > > > >
>     > > > > > > > > (This is very similar to for e.g, the evolution of
> Kafka
>     > from a
>     > > > > > pub-sub
>     > > > > > > > > system, to an event streaming platform - with addition
> of
>     > > > > > > > > MirrorMaker/Connect etc. )
>     > > > > > > > >
>     > > > > > > > > Please share your thoughts!
>     > > > > > > > >
>     > > > > > > > > Thanks
>     > > > > > > > > Vinoth
>     > > > > > > > >
>     > > > > > > >
>     > > > > > >
>     > > > > >
>     > > > >
>     > > >
>     > >
>     >
>
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to