Thanks vc Very good blog, in-depth and forward-looking. Learned!
Best, Vino Vinoth Chandar <vin...@apache.org> 于2021年7月22日周四 上午3:58写道: > Expanding to users@ as well. > > Hi all, > > Since this discussion, I started to pen down a coherent strategy and convey > these ideas via a blog post. > I have also done my own research, talked to (ex)colleagues I respect to get > their take and refine it. > > Here's a blog that hopefully explains this vision. > > https://github.com/apache/hudi/pull/3322 > > Look forward to your feedback on the PR. We are hoping to land this early > next week, if everyone is aligned. > > Thanks > Vinoth > > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637...@gmail.com> wrote: > > > +1 , Cannot agree more. > > *aux metadata* and metatable, can make hudi have large preformance > > optimization on query end. > > Can continuous develop. > > cache service may the necessary component in cloud native environment. > > > > On 2021/04/13 05:29:55, Vinoth Chandar <vin...@apache.org> wrote: > > > Hello all, > > > > > > Reading one more article today, positioning Hudi, as just a table > format, > > > made me wonder, if we have done enough justice in explaining what we > have > > > built together here. > > > I tend to think of Hudi as the data lake platform, which has the > > following > > > components, of which - one if a table format, one is a transactional > > > storage layer. > > > But the whole stack we have is definitely worth more than the sum of > all > > > the parts IMO (speaking from my own experience from the past 10+ years > of > > > open source software dev). > > > > > > Here's what we have built so far. > > > > > > a) *table format* : something that stores table schema, a metadata > table > > > that stores file listing today, and being extended to store column > ranges > > > and more in the future (RFC-27) > > > b) *aux metadata* : bloom filters, external record level indexes today, > > > bitmaps/interval trees and other advanced on-disk data structures > > tomorrow > > > c) *concurrency control* : we always supported MVCC based log based > > > concurrency (serialize writes into a time ordered log), and we now also > > > have OCC for batch merge workloads with 0.8.0. We will have multi-table > > and > > > fully non-blocking writers soon (see future work section of RFC-22) > > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, > > but > > > we support primary/unique key constraints and we could add foreign keys > > as > > > an extension, once our transactions can span tables. > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > files, > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > actions working off each other without blocking one another. (for most > > > parts). > > > f) *data services*: we also have higher level functionality with > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > callbacks, pre-commit validations are coming, error tables have been > > > proposed. I could also envision us building towards streaming egress, > > data > > > monitoring. > > > > > > I also think we should build the following (subject to separate > > > DISCUSS/RFCs) > > > > > > g) *caching service*: Hudi specific caching service that can hold > mutable > > > data and serve oft-queried data across engines. > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > turn > > > it into a scalable, sharded metastore, that all engines can use to > obtain > > > any metadata. > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed > to > > > "ingests & manages storage of large analytical datasets over DFS (hdfs > or > > > cloud stores)." and convey the scope of our vision, > > > given we have already been building towards that. It would also provide > > new > > > contributors a good lens to look at the project from. > > > > > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub > > > system, to an event streaming platform - with addition of > > > MirrorMaker/Connect etc. ) > > > > > > Please share your thoughts! > > > > > > Thanks > > > Vinoth > > > > > >