Hello all, I wanted to raise an important topic with the community around whether we should rename some of our terminologies in code/docs to be more user-friendly and understandable..
Let me also provide some context for each, since I am probably guilty of introducing most of them in the first place :). *1. Rename "views" to "query" : *Instead of saying incremental view or read-optimized view, talk about them as "incremental query" and "read-optimized query". The term "view" is very technical, and what I was trying to convey was that we ingest/store the data once and expose views on top. But new users (atleast half dozen of them to me) tend to confuse this with views/materialized views found in databases. Almost always we talk about views mostly in terms of expected behavior for a query on the view. I am proposing to just call these different query types since its a more universally accepted terminology and IMO clearer. *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have Read-Optimized view only for MOR storage :* This one is probably the trickiest. Hudi was always designed with MOR in mind, even as we were working on COW storage and consequently we named the pure parquet backed view as Read-Optimized, hoping to name parquet + avro based view as Write-Optimized. However, we opted to name it Realtime to emphasize the data freshness aspect. In retrospect, the views should have not been named after their performance characteristics but rather the classes of queries done on them and guarantees for those (point above #1). Moreover, once we have parquet embedded into the log format, then the tradeoffs may not be the same anyways. So combining with the renaming proposed in #1, we would end up with the following.. Copy-On-Write : [Old] Read-Optimized View => [New] Snapshot Query [Old] Incremental View => [New] Incremental Query Merge-On-Read: [Old] Realtime View => [New] Snapshot Query [Old] Incremental View => [New] Incremental Query [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is read optimized compared to Snapshot query always, at the cost of staler data) Both changes #1 & #2 could be simpler changes to just code references, docs and configs.. we can support both string for sometime and deprecate eventually since queries are stateless. *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the design was very similar to https://en.wikipedia.org/wiki/Copy-on-write filesystems & snapshotting and we once hoped to push some of this logic into the storage itself, all in vain. but the name stuck, even though once we had MERGE_ON_READ the focus was often on merge costs etc, which the name COPY_ON_WRITE does not convey directly. I don't feel very strong about this and there is also cost to changing this since its persisted inside hoodie.properties and we will support both strings internally in code for backwards compatibility anyway Naming something is very hard (yes, try :)).I believe these changes will make the project simpler to understand for everyone out there. We also have tons of new people here, so I am also happy to let go, if its already clear :) Please use the bullet number when you share your feedback so we know what the discussion is about. Thanks Vinoth
