Re: [DISCUSS] Hudi is the data lake platform

vino yang Fri, 30 Jul 2021 02:06:42 -0700

+1

Pratyaksh Sharma <pratyaks...@gmail.com> 于2021年7月30日周五 上午1:47写道：


> Guess we should rebrand Hudi on README.md file as well -
> https://github.com/apache/hudi#readme?
>
> This page still mentions the following -
>
> "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> Incrementals. Hudi manages the storage of large analytical datasets on
> DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>
> On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar <vin...@apache.org> wrote:
>
>> Thanks Vino! Got a bunch of emoticons on the PR as well.
>>
>> Will land this monday, giving it more time over the weekend as well.
>>
>>
>> On Wed, Jul 21, 2021 at 7:36 PM vino yang <yanghua1...@gmail.com> wrote:
>>
>> > Thanks vc
>> >
>> > Very good blog, in-depth and forward-looking. Learned!
>> >
>> > Best,
>> > Vino
>> >
>> > Vinoth Chandar <vin...@apache.org> 于2021年7月22日周四 上午3:58写道：
>> >
>> > > Expanding to users@ as well.
>> > >
>> > > Hi all,
>> > >
>> > > Since this discussion, I started to pen down a coherent strategy and
>> > convey
>> > > these ideas via a blog post.
>> > > I have also done my own research, talked to (ex)colleagues I respect
>> to
>> > get
>> > > their take and refine it.
>> > >
>> > > Here's a blog that hopefully explains this vision.
>> > >
>> > > https://github.com/apache/hudi/pull/3322
>> > >
>> > > Look forward to your feedback on the PR. We are hoping to land this
>> early
>> > > next week, if everyone is aligned.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637...@gmail.com> wrote:
>> > >
>> > > > +1 , Cannot agree more.
>> > > >  *aux metadata* and metatable, can make hudi have large preformance
>> > > > optimization on query end.
>> > > > Can continuous develop.
>> > > > cache service may the necessary component in cloud native
>> environment.
>> > > >
>> > > > On 2021/04/13 05:29:55, Vinoth Chandar <vin...@apache.org> wrote:
>> > > > > Hello all,
>> > > > >
>> > > > > Reading one more article today, positioning Hudi, as just a table
>> > > format,
>> > > > > made me wonder, if we have done enough justice in explaining what
>> we
>> > > have
>> > > > > built together here.
>> > > > > I tend to think of Hudi as the data lake platform, which has the
>> > > > following
>> > > > > components, of which - one if a table format, one is a
>> transactional
>> > > > > storage layer.
>> > > > > But the whole stack we have is definitely worth more than the sum
>> of
>> > > all
>> > > > > the parts IMO (speaking from my own experience from the past 10+
>> > years
>> > > of
>> > > > > open source software dev).
>> > > > >
>> > > > > Here's what we have built so far.
>> > > > >
>> > > > > a) *table format* : something that stores table schema, a metadata
>> > > table
>> > > > > that stores file listing today, and being extended to store column
>> > > ranges
>> > > > > and more in the future (RFC-27)
>> > > > > b) *aux metadata* : bloom filters, external record level indexes
>> > today,
>> > > > > bitmaps/interval trees and other advanced on-disk data structures
>> > > > tomorrow
>> > > > > c) *concurrency control* : we always supported MVCC based log
>> based
>> > > > > concurrency (serialize writes into a time ordered log), and we now
>> > also
>> > > > > have OCC for batch merge workloads with 0.8.0. We will have
>> > multi-table
>> > > > and
>> > > > > fully non-blocking writers soon (see future work section of
>> RFC-22)
>> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
>> > Hudi,
>> > > > but
>> > > > > we support primary/unique key constraints and we could add foreign
>> > keys
>> > > > as
>> > > > > an extension, once our transactions can span tables.
>> > > > > e) *table services*: a hudi pipeline today is self-managing -
>> sizes
>> > > > files,
>> > > > > cleans, compacts, clusters data, bootstraps existing data - all
>> these
>> > > > > actions working off each other without blocking one another. (for
>> > most
>> > > > > parts).
>> > > > > f) *data services*: we also have higher level functionality with
>> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar
>> is
>> > > > > coming, ...and more), incremental ETL support, de-duplication,
>> commit
>> > > > > callbacks, pre-commit validations are coming, error tables have
>> been
>> > > > > proposed. I could also envision us building towards streaming
>> egress,
>> > > > data
>> > > > > monitoring.
>> > > > >
>> > > > > I also think we should build the following (subject to separate
>> > > > > DISCUSS/RFCs)
>> > > > >
>> > > > > g) *caching service*: Hudi specific caching service that can hold
>> > > mutable
>> > > > > data and serve oft-queried data across engines.
>> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
>> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
>> Let's
>> > > > turn
>> > > > > it into a scalable, sharded metastore, that all engines can use to
>> > > obtain
>> > > > > any metadata.
>> > > > >
>> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
>> > opposed
>> > > to
>> > > > > "ingests & manages storage of large analytical datasets over DFS
>> > (hdfs
>> > > or
>> > > > > cloud stores)." and convey the scope of our vision,
>> > > > > given we have already been building towards that. It would also
>> > provide
>> > > > new
>> > > > > contributors a good lens to look at the project from.
>> > > > >
>> > > > > (This is very similar to for e.g, the evolution of Kafka from a
>> > pub-sub
>> > > > > system, to an event streaming platform - with addition of
>> > > > > MirrorMaker/Connect etc. )
>> > > > >
>> > > > > Please share your thoughts!
>> > > > >
>> > > > > Thanks
>> > > > > Vinoth
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Hudi is the data lake platform

Reply via email to