Re: [DISCUSS] Hudi is the data lake platform

Mehrotra, Udit Tue, 13 Apr 2021 11:57:43 -0700

Agree with the rebranding Vinoth. Hudi is not just a "table format" and we need 
to do justice to all the cool auxiliary features/services we have built.


Also, timeline metadata service in particular would be a really big win if we 
move towards something like that.

On 4/13/21, 11:01 AM, "Pratyaksh Sharma" <[email protected]> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    Definitely we are doing much more than only ingesting and managing data
    over DFS.

    +1 from my side as well. :)

    On Tue, Apr 13, 2021 at 10:02 PM Susu Dong <[email protected]> wrote:

    > I love this rebranding. Totally agree. +1
    >
    > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <[email protected]>
    > wrote:
    >
    > > +1 The vision looks fantastic.
    > >
    > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <[email protected]> wrote:
    > >
    > > > Awesome summary of Hudi! +1 as well.
    > > >
    > > > Gary Li
    > > > On 2021/04/13 14:13:24, Rubens Rodrigues <[email protected]>
    > > > wrote:
    > > > > Excellent, I agree
    > > > >
    > > > > Em ter, 13 de abr de 2021 07:23, vino yang <[email protected]>
    > > > escreveu:
    > > > >
    > > > > > +1 Excited by this new vision!
    > > > > >
    > > > > > Best,
    > > > > > Vino
    > > > > >
    > > > > > Dianjin Wang <[email protected]> 于2021年4月13日周二
    > > 下午3:53写道：
    > > > > >
    > > > > > > +1  The new brand is straightforward, a better description of
    > Hudi.
    > > > > > >
    > > > > > > Best,
    > > > > > > Dianjin Wang
    > > > > > >
    > > > > > >
    > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <
    > > > [email protected]>
    > > > > > > wrote:
    > > > > > >
    > > > > > > > +1 . Cannot agree more. I think this makes total sense and 
will
    > > > provide
    > > > > > > for
    > > > > > > > a much better representation of the project.
    > > > > > > >
    > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <
    > > [email protected]
    > > > >
    > > > > > > wrote:
    > > > > > > >
    > > > > > > > > Hello all,
    > > > > > > > >
    > > > > > > > > Reading one more article today, positioning Hudi, as just a
    > > table
    > > > > > > format,
    > > > > > > > > made me wonder, if we have done enough justice in explaining
    > > > what we
    > > > > > > have
    > > > > > > > > built together here.
    > > > > > > > > I tend to think of Hudi as the data lake platform, which has
    > > the
    > > > > > > > following
    > > > > > > > > components, of which - one if a table format, one is a
    > > > transactional
    > > > > > > > > storage layer.
    > > > > > > > > But the whole stack we have is definitely worth more than 
the
    > > > sum of
    > > > > > > all
    > > > > > > > > the parts IMO (speaking from my own experience from the past
    > > 10+
    > > > > > years
    > > > > > > of
    > > > > > > > > open source software dev).
    > > > > > > > >
    > > > > > > > > Here's what we have built so far.
    > > > > > > > >
    > > > > > > > > a) *table format* : something that stores table schema, a
    > > > metadata
    > > > > > > table
    > > > > > > > > that stores file listing today, and being extended to store
    > > > column
    > > > > > > ranges
    > > > > > > > > and more in the future (RFC-27)
    > > > > > > > > b) *aux metadata* : bloom filters, external record level
    > > indexes
    > > > > > today,
    > > > > > > > > bitmaps/interval trees and other advanced on-disk data
    > > structures
    > > > > > > > tomorrow
    > > > > > > > > c) *concurrency control* : we always supported MVCC based 
log
    > > > based
    > > > > > > > > concurrency (serialize writes into a time ordered log), and
    > we
    > > > now
    > > > > > also
    > > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have
    > > > > > multi-table
    > > > > > > > and
    > > > > > > > > fully non-blocking writers soon (see future work section of
    > > > RFC-22)
    > > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case
    > > for
    > > > > > Hudi,
    > > > > > > > but
    > > > > > > > > we support primary/unique key constraints and we could add
    > > > foreign
    > > > > > keys
    > > > > > > > as
    > > > > > > > > an extension, once our transactions can span tables.
    > > > > > > > > e) *table services*: a hudi pipeline today is self-managing 
-
    > > > sizes
    > > > > > > > files,
    > > > > > > > > cleans, compacts, clusters data, bootstraps existing data -
    > all
    > > > these
    > > > > > > > > actions working off each other without blocking one another.
    > > (for
    > > > > > most
    > > > > > > > > parts).
    > > > > > > > > f) *data services*: we also have higher level functionality
    > > with
    > > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka,
    > > > Pulsar is
    > > > > > > > > coming, ...and more), incremental ETL support,
    > de-duplication,
    > > > commit
    > > > > > > > > callbacks, pre-commit validations are coming, error tables
    > have
    > > > been
    > > > > > > > > proposed. I could also envision us building towards 
streaming
    > > > egress,
    > > > > > > > data
    > > > > > > > > monitoring.
    > > > > > > > >
    > > > > > > > > I also think we should build the following (subject to
    > separate
    > > > > > > > > DISCUSS/RFCs)
    > > > > > > > >
    > > > > > > > > g) *caching service*: Hudi specific caching service that can
    > > hold
    > > > > > > mutable
    > > > > > > > > data and serve oft-queried data across engines.
    > > > > > > > > h) t*imeline metaserver:* We already run a metaserver in
    > spark
    > > > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata
    > table.
    > > > Let's
    > > > > > > > turn
    > > > > > > > > it into a scalable, sharded metastore, that all engines can
    > use
    > > > to
    > > > > > > obtain
    > > > > > > > > any metadata.
    > > > > > > > >
    > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*"
    > as
    > > > > > opposed
    > > > > > > to
    > > > > > > > > "ingests & manages storage of large analytical datasets over
    > > DFS
    > > > > > (hdfs
    > > > > > > or
    > > > > > > > > cloud stores)." and convey the scope of our vision,
    > > > > > > > > given we have already been building towards that. It would
    > also
    > > > > > provide
    > > > > > > > new
    > > > > > > > > contributors a good lens to look at the project from.
    > > > > > > > >
    > > > > > > > > (This is very similar to for e.g, the evolution of Kafka
    > from a
    > > > > > pub-sub
    > > > > > > > > system, to an event streaming platform - with addition of
    > > > > > > > > MirrorMaker/Connect etc. )
    > > > > > > > >
    > > > > > > > > Please share your thoughts!
    > > > > > > > >
    > > > > > > > > Thanks
    > > > > > > > > Vinoth
    > > > > > > > >
    > > > > > > >
    > > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to