[DISCUSS] Hudi is the data lake platform

Vinoth Chandar Mon, 12 Apr 2021 22:30:42 -0700

Hello all,

Reading one more article today, positioning Hudi, as just a table format,
made me wonder, if we have done enough justice in explaining what we have
built together here.
I tend to think of Hudi as the data lake platform, which has the following
components, of which - one if a table format, one is a transactional
storage layer.
But the whole stack we have is definitely worth more than the sum of all
the parts IMO (speaking from my own experience from the past 10+ years of
open source software dev).


Here's what we have built so far.

a) *table format* : something that stores table schema, a metadata table
that stores file listing today, and being extended to store column ranges
and more in the future (RFC-27)
b) *aux metadata* : bloom filters, external record level indexes today,
bitmaps/interval trees and other advanced on-disk data structures tomorrow
c) *concurrency control* : we always supported MVCC based log based
concurrency (serialize writes into a time ordered log), and we now also
have OCC for batch merge workloads with 0.8.0. We will have multi-table and
fully non-blocking writers soon (see future work section of RFC-22)
d) *updates/deletes* : this is the bread-and-butter use-case for Hudi, but
we support primary/unique key constraints and we could add foreign keys as
an extension, once our transactions can span tables.
e) *table services*: a hudi pipeline today is self-managing - sizes files,
cleans, compacts, clusters data, bootstraps existing data - all these
actions working off each other without blocking one another. (for most
parts).
f) *data services*: we also have higher level functionality with
deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
coming, ...and more), incremental ETL support, de-duplication, commit
callbacks, pre-commit validations are coming, error tables have been
proposed. I could also envision us building towards streaming egress, data
monitoring.

I also think we should build the following (subject to separate
DISCUSS/RFCs)

g) *caching service*: Hudi specific caching service that can hold mutable
data and serve oft-queried data across engines.
h) t*imeline metaserver:* We already run a metaserver in spark
writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's turn
it into a scalable, sharded metastore, that all engines can use to obtain
any metadata.

To this end, I propose we rebrand to "*Data Lake Platform*" as opposed to
"ingests & manages storage of large analytical datasets over DFS (hdfs or
cloud stores)." and convey the scope of our vision,
given we have already been building towards that. It would also provide new
contributors a good lens to look at the project from.

(This is very similar to for e.g, the evolution of Kafka from a pub-sub
system, to an event streaming platform - with addition of
MirrorMaker/Connect etc. )

Please share your thoughts!

Thanks
Vinoth

[DISCUSS] Hudi is the data lake platform

Reply via email to