Hello all, Reading one more article today, positioning Hudi, as just a table format, made me wonder, if we have done enough justice in explaining what we have built together here. I tend to think of Hudi as the data lake platform, which has the following components, of which - one if a table format, one is a transactional storage layer. But the whole stack we have is definitely worth more than the sum of all the parts IMO (speaking from my own experience from the past 10+ years of open source software dev).
Here's what we have built so far. a) *table format* : something that stores table schema, a metadata table that stores file listing today, and being extended to store column ranges and more in the future (RFC-27) b) *aux metadata* : bloom filters, external record level indexes today, bitmaps/interval trees and other advanced on-disk data structures tomorrow c) *concurrency control* : we always supported MVCC based log based concurrency (serialize writes into a time ordered log), and we now also have OCC for batch merge workloads with 0.8.0. We will have multi-table and fully non-blocking writers soon (see future work section of RFC-22) d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but we support primary/unique key constraints and we could add foreign keys as an extension, once our transactions can span tables. e) *table services*: a hudi pipeline today is self-managing - sizes files, cleans, compacts, clusters data, bootstraps existing data - all these actions working off each other without blocking one another. (for most parts). f) *data services*: we also have higher level functionality with deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is coming, ...and more), incremental ETL support, de-duplication, commit callbacks, pre-commit validations are coming, error tables have been proposed. I could also envision us building towards streaming egress, data monitoring. I also think we should build the following (subject to separate DISCUSS/RFCs) g) *caching service*: Hudi specific caching service that can hold mutable data and serve oft-queried data across engines. h) t*imeline metaserver:* We already run a metaserver in spark writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn it into a scalable, sharded metastore, that all engines can use to obtain any metadata. To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to "ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores)." and convey the scope of our vision, given we have already been building towards that. It would also provide new contributors a good lens to look at the project from. (This is very similar to for e.g, the evolution of Kafka from a pub-sub system, to an event streaming platform - with addition of MirrorMaker/Connect etc. ) Please share your thoughts! Thanks Vinoth