Hello all,

I wanted to raise an important topic with the community around whether we
should rename some of our terminologies in code/docs to be more
user-friendly and understandable..

Let me also provide some context for each, since I am probably guilty of
introducing most of them in the first place :).

*1. Rename "views" to "query" : *Instead of saying incremental view or
read-optimized view, talk about them as "incremental query" and
"read-optimized query". The term "view" is very technical, and what I was
trying to convey was that we ingest/store the data once and expose views on
top. But new users (atleast half dozen of them to me) tend to confuse this
with views/materialized views found in databases. Almost always we talk
about views mostly in terms of expected behavior for a query on the view. I
am proposing to just call these different query types since its a more
universally accepted terminology and IMO clearer.

*2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
Read-Optimized view only for MOR storage :* This one is probably the
trickiest. Hudi was always designed with MOR in mind, even as we were
working on COW storage and consequently we named the pure parquet backed
view as Read-Optimized, hoping to name parquet + avro based view as
Write-Optimized. However, we opted to name it Realtime to emphasize the
data freshness aspect. In retrospect, the views should have not been named
after their performance characteristics but rather the classes of queries
done on them and guarantees for those (point above #1). Moreover, once we
have parquet embedded into the log format, then the tradeoffs may not be
the same anyways.

So combining with the renaming proposed in #1, we would end up with the
following..

Copy-On-Write :
[Old]  Read-Optimized View =>  [New] Snapshot Query
[Old]  Incremental View => [New] Incremental Query

Merge-On-Read:
[Old] Realtime View => [New] Snapshot Query
[Old] Incremental View => [New] Incremental Query
[Old] ReadOptimzied View => [New] Read-Optimized Query (since it is read
optimized compared to Snapshot query always, at the cost of staler data)

Both changes #1 & #2 could be simpler changes to just code references, docs
and configs.. we can support both string for sometime and deprecate
eventually since queries are stateless.

*3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the
design was very similar to https://en.wikipedia.org/wiki/Copy-on-write
filesystems
& snapshotting and we once hoped to push some of this logic into the
storage itself, all in vain. but the name stuck, even though once we had
MERGE_ON_READ the focus was often on merge costs etc, which the name
COPY_ON_WRITE does not convey directly. I don't feel very strong about this
and there is also cost to changing this since its persisted inside
hoodie.properties and we will support both strings internally in code for
backwards compatibility anyway

Naming something is very hard (yes, try :)).I believe these changes will
make the project simpler to understand for everyone out there. We also have
tons of new people here, so I am also happy to let go, if its already clear
:)

Please use the bullet number when you share your feedback so we know what
the discussion is about.

Thanks
Vinoth

Reply via email to