Re: [DISCUSS] Hudi is the data lake platform

Vinoth Chandar Wed, 21 Jul 2021 12:58:07 -0700

Expanding to users@ as well.

Hi all,


Since this discussion, I started to pen down a coherent strategy and convey
these ideas via a blog post.
I have also done my own research, talked to (ex)colleagues I respect to get
their take and refine it.

Here's a blog that hopefully explains this vision.

https://github.com/apache/hudi/pull/3322

Look forward to your feedback on the PR. We are hoping to land this early
next week, if everyone is aligned.

Thanks
Vinoth

On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637...@gmail.com> wrote:

> +1 , Cannot agree more.
>  *aux metadata* and metatable, can make hudi have large preformance
> optimization on query end.
> Can continuous develop.
> cache service may the necessary component in cloud native environment.
>
> On 2021/04/13 05:29:55, Vinoth Chandar <vin...@apache.org> wrote:
> > Hello all,
> >
> > Reading one more article today, positioning Hudi, as just a table format,
> > made me wonder, if we have done enough justice in explaining what we have
> > built together here.
> > I tend to think of Hudi as the data lake platform, which has the
> following
> > components, of which - one if a table format, one is a transactional
> > storage layer.
> > But the whole stack we have is definitely worth more than the sum of all
> > the parts IMO (speaking from my own experience from the past 10+ years of
> > open source software dev).
> >
> > Here's what we have built so far.
> >
> > a) *table format* : something that stores table schema, a metadata table
> > that stores file listing today, and being extended to store column ranges
> > and more in the future (RFC-27)
> > b) *aux metadata* : bloom filters, external record level indexes today,
> > bitmaps/interval trees and other advanced on-disk data structures
> tomorrow
> > c) *concurrency control* : we always supported MVCC based log based
> > concurrency (serialize writes into a time ordered log), and we now also
> > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> and
> > fully non-blocking writers soon (see future work section of RFC-22)
> > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> but
> > we support primary/unique key constraints and we could add foreign keys
> as
> > an extension, once our transactions can span tables.
> > e) *table services*: a hudi pipeline today is self-managing - sizes
> files,
> > cleans, compacts, clusters data, bootstraps existing data - all these
> > actions working off each other without blocking one another. (for most
> > parts).
> > f) *data services*: we also have higher level functionality with
> > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > coming, ...and more), incremental ETL support, de-duplication, commit
> > callbacks, pre-commit validations are coming, error tables have been
> > proposed. I could also envision us building towards streaming egress,
> data
> > monitoring.
> >
> > I also think we should build the following (subject to separate
> > DISCUSS/RFCs)
> >
> > g) *caching service*: Hudi specific caching service that can hold mutable
> > data and serve oft-queried data across engines.
> > h) t*imeline metaserver:* We already run a metaserver in spark
> > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> turn
> > it into a scalable, sharded metastore, that all engines can use to obtain
> > any metadata.
> >
> > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
> > "ingests & manages storage of large analytical datasets over DFS (hdfs or
> > cloud stores)." and convey the scope of our vision,
> > given we have already been building towards that. It would also provide
> new
> > contributors a good lens to look at the project from.
> >
> > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > system, to an event streaming platform - with addition of
> > MirrorMaker/Connect etc. )
> >
> > Please share your thoughts!
> >
> > Thanks
> > Vinoth
> >
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to