Re: [DISCUSS] Hudi is the data lake platform

vino yang Tue, 13 Apr 2021 03:24:09 -0700

+1 Excited by this new vision!

Best,
Vino


Dianjin Wang <djw...@streamnative.io.invalid> 于2021年4月13日周二 下午3:53写道：

> +1  The new brand is straightforward, a better description of Hudi.
>
> Best,
> Dianjin Wang
>
>
> On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha <bhavanisud...@gmail.com>
> wrote:
>
> > +1 . Cannot agree more. I think this makes total sense and will provide
> for
> > a much better representation of the project.
> >
> > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> >
> > > Hello all,
> > >
> > > Reading one more article today, positioning Hudi, as just a table
> format,
> > > made me wonder, if we have done enough justice in explaining what we
> have
> > > built together here.
> > > I tend to think of Hudi as the data lake platform, which has the
> > following
> > > components, of which - one if a table format, one is a transactional
> > > storage layer.
> > > But the whole stack we have is definitely worth more than the sum of
> all
> > > the parts IMO (speaking from my own experience from the past 10+ years
> of
> > > open source software dev).
> > >
> > > Here's what we have built so far.
> > >
> > > a) *table format* : something that stores table schema, a metadata
> table
> > > that stores file listing today, and being extended to store column
> ranges
> > > and more in the future (RFC-27)
> > > b) *aux metadata* : bloom filters, external record level indexes today,
> > > bitmaps/interval trees and other advanced on-disk data structures
> > tomorrow
> > > c) *concurrency control* : we always supported MVCC based log based
> > > concurrency (serialize writes into a time ordered log), and we now also
> > > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> > and
> > > fully non-blocking writers soon (see future work section of RFC-22)
> > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> > but
> > > we support primary/unique key constraints and we could add foreign keys
> > as
> > > an extension, once our transactions can span tables.
> > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > files,
> > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > actions working off each other without blocking one another. (for most
> > > parts).
> > > f) *data services*: we also have higher level functionality with
> > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > callbacks, pre-commit validations are coming, error tables have been
> > > proposed. I could also envision us building towards streaming egress,
> > data
> > > monitoring.
> > >
> > > I also think we should build the following (subject to separate
> > > DISCUSS/RFCs)
> > >
> > > g) *caching service*: Hudi specific caching service that can hold
> mutable
> > > data and serve oft-queried data across engines.
> > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > turn
> > > it into a scalable, sharded metastore, that all engines can use to
> obtain
> > > any metadata.
> > >
> > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed
> to
> > > "ingests & manages storage of large analytical datasets over DFS (hdfs
> or
> > > cloud stores)." and convey the scope of our vision,
> > > given we have already been building towards that. It would also provide
> > new
> > > contributors a good lens to look at the project from.
> > >
> > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > > system, to an event streaming platform - with addition of
> > > MirrorMaker/Connect etc. )
> > >
> > > Please share your thoughts!
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to