Re: [DISCUSS] Hudi is the data lake platform

Vinoth Chandar Wed, 04 Aug 2021 10:36:24 -0700

Folks,

I have been digesting some feedback on what we show on the home page itself.


While the blog explains the vision, it might be good to bubble up sub-areas
that are
more relevant to our users today. transactions, updates, deletes.

So, i have raised a PR moving stuff around.

Now we lead with
- "Hudi brings transactions, record-level updates/deletes and change
streams to data lakes"

then explain the platform, in the next level of detail.

https://github.com/apache/hudi/pull/3406

On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar <vin...@apache.org> wrote:

> Thanks! Will work on it this week.
> Also redoing some images based on feedback.
>
> On Fri, Jul 30, 2021 at 2:06 AM vino yang <yanghua1...@gmail.com> wrote:
>
>> +1
>>
>> Pratyaksh Sharma <pratyaks...@gmail.com> 于2021年7月30日周五 上午1:47写道：
>>
>> > Guess we should rebrand Hudi on README.md file as well -
>> > https://github.com/apache/hudi#readme?
>> >
>> > This page still mentions the following -
>> >
>> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
>> > Incrementals. Hudi manages the storage of large analytical datasets on
>> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>> >
>> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar <vin...@apache.org>
>> wrote:
>> >
>> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
>> >>
>> >> Will land this monday, giving it more time over the weekend as well.
>> >>
>> >>
>> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang <yanghua1...@gmail.com>
>> wrote:
>> >>
>> >> > Thanks vc
>> >> >
>> >> > Very good blog, in-depth and forward-looking. Learned!
>> >> >
>> >> > Best,
>> >> > Vino
>> >> >
>> >> > Vinoth Chandar <vin...@apache.org> 于2021年7月22日周四 上午3:58写道：
>> >> >
>> >> > > Expanding to users@ as well.
>> >> > >
>> >> > > Hi all,
>> >> > >
>> >> > > Since this discussion, I started to pen down a coherent strategy
>> and
>> >> > convey
>> >> > > these ideas via a blog post.
>> >> > > I have also done my own research, talked to (ex)colleagues I
>> respect
>> >> to
>> >> > get
>> >> > > their take and refine it.
>> >> > >
>> >> > > Here's a blog that hopefully explains this vision.
>> >> > >
>> >> > > https://github.com/apache/hudi/pull/3322
>> >> > >
>> >> > > Look forward to your feedback on the PR. We are hoping to land this
>> >> early
>> >> > > next week, if everyone is aligned.
>> >> > >
>> >> > > Thanks
>> >> > > Vinoth
>> >> > >
>> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637...@gmail.com>
>> wrote:
>> >> > >
>> >> > > > +1 , Cannot agree more.
>> >> > > >  *aux metadata* and metatable, can make hudi have large
>> preformance
>> >> > > > optimization on query end.
>> >> > > > Can continuous develop.
>> >> > > > cache service may the necessary component in cloud native
>> >> environment.
>> >> > > >
>> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar <vin...@apache.org>
>> wrote:
>> >> > > > > Hello all,
>> >> > > > >
>> >> > > > > Reading one more article today, positioning Hudi, as just a
>> table
>> >> > > format,
>> >> > > > > made me wonder, if we have done enough justice in explaining
>> what
>> >> we
>> >> > > have
>> >> > > > > built together here.
>> >> > > > > I tend to think of Hudi as the data lake platform, which has
>> the
>> >> > > > following
>> >> > > > > components, of which - one if a table format, one is a
>> >> transactional
>> >> > > > > storage layer.
>> >> > > > > But the whole stack we have is definitely worth more than the
>> sum
>> >> of
>> >> > > all
>> >> > > > > the parts IMO (speaking from my own experience from the past
>> 10+
>> >> > years
>> >> > > of
>> >> > > > > open source software dev).
>> >> > > > >
>> >> > > > > Here's what we have built so far.
>> >> > > > >
>> >> > > > > a) *table format* : something that stores table schema, a
>> metadata
>> >> > > table
>> >> > > > > that stores file listing today, and being extended to store
>> column
>> >> > > ranges
>> >> > > > > and more in the future (RFC-27)
>> >> > > > > b) *aux metadata* : bloom filters, external record level
>> indexes
>> >> > today,
>> >> > > > > bitmaps/interval trees and other advanced on-disk data
>> structures
>> >> > > > tomorrow
>> >> > > > > c) *concurrency control* : we always supported MVCC based log
>> >> based
>> >> > > > > concurrency (serialize writes into a time ordered log), and we
>> now
>> >> > also
>> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have
>> >> > multi-table
>> >> > > > and
>> >> > > > > fully non-blocking writers soon (see future work section of
>> >> RFC-22)
>> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case
>> for
>> >> > Hudi,
>> >> > > > but
>> >> > > > > we support primary/unique key constraints and we could add
>> foreign
>> >> > keys
>> >> > > > as
>> >> > > > > an extension, once our transactions can span tables.
>> >> > > > > e) *table services*: a hudi pipeline today is self-managing -
>> >> sizes
>> >> > > > files,
>> >> > > > > cleans, compacts, clusters data, bootstraps existing data - all
>> >> these
>> >> > > > > actions working off each other without blocking one another.
>> (for
>> >> > most
>> >> > > > > parts).
>> >> > > > > f) *data services*: we also have higher level functionality
>> with
>> >> > > > > deltastreamer sources (scalable DFS listing source, Kafka,
>> Pulsar
>> >> is
>> >> > > > > coming, ...and more), incremental ETL support, de-duplication,
>> >> commit
>> >> > > > > callbacks, pre-commit validations are coming, error tables have
>> >> been
>> >> > > > > proposed. I could also envision us building towards streaming
>> >> egress,
>> >> > > > data
>> >> > > > > monitoring.
>> >> > > > >
>> >> > > > > I also think we should build the following (subject to separate
>> >> > > > > DISCUSS/RFCs)
>> >> > > > >
>> >> > > > > g) *caching service*: Hudi specific caching service that can
>> hold
>> >> > > mutable
>> >> > > > > data and serve oft-queried data across engines.
>> >> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
>> >> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
>> >> Let's
>> >> > > > turn
>> >> > > > > it into a scalable, sharded metastore, that all engines can
>> use to
>> >> > > obtain
>> >> > > > > any metadata.
>> >> > > > >
>> >> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
>> >> > opposed
>> >> > > to
>> >> > > > > "ingests & manages storage of large analytical datasets over
>> DFS
>> >> > (hdfs
>> >> > > or
>> >> > > > > cloud stores)." and convey the scope of our vision,
>> >> > > > > given we have already been building towards that. It would also
>> >> > provide
>> >> > > > new
>> >> > > > > contributors a good lens to look at the project from.
>> >> > > > >
>> >> > > > > (This is very similar to for e.g, the evolution of Kafka from a
>> >> > pub-sub
>> >> > > > > system, to an event streaming platform - with addition of
>> >> > > > > MirrorMaker/Connect etc. )
>> >> > > > >
>> >> > > > > Please share your thoughts!
>> >> > > > >
>> >> > > > > Thanks
>> >> > > > > Vinoth
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to