I love this rebranding. Totally agree. +1 On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu <xu.shiyan.raym...@gmail.com> wrote:
> +1 The vision looks fantastic. > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li <gar...@apache.org> wrote: > > > Awesome summary of Hudi! +1 as well. > > > > Gary Li > > On 2021/04/13 14:13:24, Rubens Rodrigues <rubenssoto2...@gmail.com> > > wrote: > > > Excellent, I agree > > > > > > Em ter, 13 de abr de 2021 07:23, vino yang <yanghua1...@gmail.com> > > escreveu: > > > > > > > +1 Excited by this new vision! > > > > > > > > Best, > > > > Vino > > > > > > > > Dianjin Wang <djw...@streamnative.io.invalid> 于2021年4月13日周二 > 下午3:53写道: > > > > > > > > > +1 The new brand is straightforward, a better description of Hudi. > > > > > > > > > > Best, > > > > > Dianjin Wang > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha < > > bhavanisud...@gmail.com> > > > > > wrote: > > > > > > > > > > > +1 . Cannot agree more. I think this makes total sense and will > > provide > > > > > for > > > > > > a much better representation of the project. > > > > > > > > > > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar < > vin...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > Reading one more article today, positioning Hudi, as just a > table > > > > > format, > > > > > > > made me wonder, if we have done enough justice in explaining > > what we > > > > > have > > > > > > > built together here. > > > > > > > I tend to think of Hudi as the data lake platform, which has > the > > > > > > following > > > > > > > components, of which - one if a table format, one is a > > transactional > > > > > > > storage layer. > > > > > > > But the whole stack we have is definitely worth more than the > > sum of > > > > > all > > > > > > > the parts IMO (speaking from my own experience from the past > 10+ > > > > years > > > > > of > > > > > > > open source software dev). > > > > > > > > > > > > > > Here's what we have built so far. > > > > > > > > > > > > > > a) *table format* : something that stores table schema, a > > metadata > > > > > table > > > > > > > that stores file listing today, and being extended to store > > column > > > > > ranges > > > > > > > and more in the future (RFC-27) > > > > > > > b) *aux metadata* : bloom filters, external record level > indexes > > > > today, > > > > > > > bitmaps/interval trees and other advanced on-disk data > structures > > > > > > tomorrow > > > > > > > c) *concurrency control* : we always supported MVCC based log > > based > > > > > > > concurrency (serialize writes into a time ordered log), and we > > now > > > > also > > > > > > > have OCC for batch merge workloads with 0.8.0. We will have > > > > multi-table > > > > > > and > > > > > > > fully non-blocking writers soon (see future work section of > > RFC-22) > > > > > > > d) *updates/deletes* : this is the bread-and-butter use-case > for > > > > Hudi, > > > > > > but > > > > > > > we support primary/unique key constraints and we could add > > foreign > > > > keys > > > > > > as > > > > > > > an extension, once our transactions can span tables. > > > > > > > e) *table services*: a hudi pipeline today is self-managing - > > sizes > > > > > > files, > > > > > > > cleans, compacts, clusters data, bootstraps existing data - all > > these > > > > > > > actions working off each other without blocking one another. > (for > > > > most > > > > > > > parts). > > > > > > > f) *data services*: we also have higher level functionality > with > > > > > > > deltastreamer sources (scalable DFS listing source, Kafka, > > Pulsar is > > > > > > > coming, ...and more), incremental ETL support, de-duplication, > > commit > > > > > > > callbacks, pre-commit validations are coming, error tables have > > been > > > > > > > proposed. I could also envision us building towards streaming > > egress, > > > > > > data > > > > > > > monitoring. > > > > > > > > > > > > > > I also think we should build the following (subject to separate > > > > > > > DISCUSS/RFCs) > > > > > > > > > > > > > > g) *caching service*: Hudi specific caching service that can > hold > > > > > mutable > > > > > > > data and serve oft-queried data across engines. > > > > > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. > > Let's > > > > > > turn > > > > > > > it into a scalable, sharded metastore, that all engines can use > > to > > > > > obtain > > > > > > > any metadata. > > > > > > > > > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as > > > > opposed > > > > > to > > > > > > > "ingests & manages storage of large analytical datasets over > DFS > > > > (hdfs > > > > > or > > > > > > > cloud stores)." and convey the scope of our vision, > > > > > > > given we have already been building towards that. It would also > > > > provide > > > > > > new > > > > > > > contributors a good lens to look at the project from. > > > > > > > > > > > > > > (This is very similar to for e.g, the evolution of Kafka from a > > > > pub-sub > > > > > > > system, to an event streaming platform - with addition of > > > > > > > MirrorMaker/Connect etc. ) > > > > > > > > > > > > > > Please share your thoughts! > > > > > > > > > > > > > > Thanks > > > > > > > Vinoth > > > > > > > > > > > > > > > > > > > > > > > > > > > >