Thanks! Will work on it this week. Also redoing some images based on feedback.
On Fri, Jul 30, 2021 at 2:06 AM vino yang <[email protected]> wrote: > +1 > > Pratyaksh Sharma <[email protected]> 于2021年7月30日周五 上午1:47写道: > > > Guess we should rebrand Hudi on README.md file as well - > > https://github.com/apache/hudi#readme? > > > > This page still mentions the following - > > > > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and > > Incrementals. Hudi manages the storage of large analytical datasets on > > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)." > > > > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar <[email protected]> > wrote: > > > >> Thanks Vino! Got a bunch of emoticons on the PR as well. > >> > >> Will land this monday, giving it more time over the weekend as well. > >> > >> > >> On Wed, Jul 21, 2021 at 7:36 PM vino yang <[email protected]> > wrote: > >> > >> > Thanks vc > >> > > >> > Very good blog, in-depth and forward-looking. Learned! > >> > > >> > Best, > >> > Vino > >> > > >> > Vinoth Chandar <[email protected]> 于2021年7月22日周四 上午3:58写道: > >> > > >> > > Expanding to users@ as well. > >> > > > >> > > Hi all, > >> > > > >> > > Since this discussion, I started to pen down a coherent strategy and > >> > convey > >> > > these ideas via a blog post. > >> > > I have also done my own research, talked to (ex)colleagues I respect > >> to > >> > get > >> > > their take and refine it. > >> > > > >> > > Here's a blog that hopefully explains this vision. > >> > > > >> > > https://github.com/apache/hudi/pull/3322 > >> > > > >> > > Look forward to your feedback on the PR. We are hoping to land this > >> early > >> > > next week, if everyone is aligned. > >> > > > >> > > Thanks > >> > > Vinoth > >> > > > >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li <[email protected]> > wrote: > >> > > > >> > > > +1 , Cannot agree more. > >> > > > *aux metadata* and metatable, can make hudi have large > preformance > >> > > > optimization on query end. > >> > > > Can continuous develop. > >> > > > cache service may the necessary component in cloud native > >> environment. > >> > > > > >> > > > On 2021/04/13 05:29:55, Vinoth Chandar <[email protected]> wrote: > >> > > > > Hello all, > >> > > > > > >> > > > > Reading one more article today, positioning Hudi, as just a > table > >> > > format, > >> > > > > made me wonder, if we have done enough justice in explaining > what > >> we > >> > > have > >> > > > > built together here. > >> > > > > I tend to think of Hudi as the data lake platform, which has the > >> > > > following > >> > > > > components, of which - one if a table format, one is a > >> transactional > >> > > > > storage layer. > >> > > > > But the whole stack we have is definitely worth more than the > sum > >> of > >> > > all > >> > > > > the parts IMO (speaking from my own experience from the past 10+ > >> > years > >> > > of > >> > > > > open source software dev). > >> > > > > > >> > > > > Here's what we have built so far. > >> > > > > > >> > > > > a) *table format* : something that stores table schema, a > metadata > >> > > table > >> > > > > that stores file listing today, and being extended to store > column > >> > > ranges > >> > > > > and more in the future (RFC-27) > >> > > > > b) *aux metadata* : bloom filters, external record level indexes > >> > today, > >> > > > > bitmaps/interval trees and other advanced on-disk data > structures > >> > > > tomorrow > >> > > > > c) *concurrency control* : we always supported MVCC based log > >> based > >> > > > > concurrency (serialize writes into a time ordered log), and we > now > >> > also > >> > > > > have OCC for batch merge workloads with 0.8.0. We will have > >> > multi-table > >> > > > and > >> > > > > fully non-blocking writers soon (see future work section of > >> RFC-22) > >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for > >> > Hudi, > >> > > > but > >> > > > > we support primary/unique key constraints and we could add > foreign > >> > keys > >> > > > as > >> > > > > an extension, once our transactions can span tables. > >> > > > > e) *table services*: a hudi pipeline today is self-managing - > >> sizes > >> > > > files, > >> > > > > cleans, compacts, clusters data, bootstraps existing data - all > >> these > >> > > > > actions working off each other without blocking one another. > (for > >> > most > >> > > > > parts). > >> > > > > f) *data services*: we also have higher level functionality with > >> > > > > deltastreamer sources (scalable DFS listing source, Kafka, > Pulsar > >> is > >> > > > > coming, ...and more), incremental ETL support, de-duplication, > >> commit > >> > > > > callbacks, pre-commit validations are coming, error tables have > >> been > >> > > > > proposed. I could also envision us building towards streaming > >> egress, > >> > > > data > >> > > > > monitoring. > >> > > > > > >> > > > > I also think we should build the following (subject to separate > >> > > > > DISCUSS/RFCs) > >> > > > > > >> > > > > g) *caching service*: Hudi specific caching service that can > hold > >> > > mutable > >> > > > > data and serve oft-queried data across engines. > >> > > > > h) t*imeline metaserver:* We already run a metaserver in spark > >> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. > >> Let's > >> > > > turn > >> > > > > it into a scalable, sharded metastore, that all engines can use > to > >> > > obtain > >> > > > > any metadata. > >> > > > > > >> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > >> > opposed > >> > > to > >> > > > > "ingests & manages storage of large analytical datasets over DFS > >> > (hdfs > >> > > or > >> > > > > cloud stores)." and convey the scope of our vision, > >> > > > > given we have already been building towards that. It would also > >> > provide > >> > > > new > >> > > > > contributors a good lens to look at the project from. > >> > > > > > >> > > > > (This is very similar to for e.g, the evolution of Kafka from a > >> > pub-sub > >> > > > > system, to an event streaming platform - with addition of > >> > > > > MirrorMaker/Connect etc. ) > >> > > > > > >> > > > > Please share your thoughts! > >> > > > > > >> > > > > Thanks > >> > > > > Vinoth > >> > > > > > >> > > > > >> > > > >> > > >> > > >
