Re: [DISCUSS] Simplification of terminologies

Bhavani Sudha Mon, 11 Nov 2019 17:05:43 -0800

+1 on all three rename proposals. I think this would make the concepts
super easy to follow for new users.


If changing [3] seems to be a stretch, we should definitely do [1] & [2] at
the least IMO. I will be glad to help out on the renames to whatever extent
possible should the Hudi community incline to pursue this.

Thanks,
Sudha



On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <[email protected]> wrote:

> Hello all,
>
> I wanted to raise an important topic with the community around whether we
> should rename some of our terminologies in code/docs to be more
> user-friendly and understandable..
>
> Let me also provide some context for each, since I am probably guilty of
> introducing most of them in the first place :).
>
> *1. Rename "views" to "query" : *Instead of saying incremental view or
> read-optimized view, talk about them as "incremental query" and
> "read-optimized query". The term "view" is very technical, and what I was
> trying to convey was that we ingest/store the data once and expose views on
> top. But new users (atleast half dozen of them to me) tend to confuse this
> with views/materialized views found in databases. Almost always we talk
> about views mostly in terms of expected behavior for a query on the view. I
> am proposing to just call these different query types since its a more
> universally accepted terminology and IMO clearer.
>
> *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> Read-Optimized view only for MOR storage :* This one is probably the
> trickiest. Hudi was always designed with MOR in mind, even as we were
> working on COW storage and consequently we named the pure parquet backed
> view as Read-Optimized, hoping to name parquet + avro based view as
> Write-Optimized. However, we opted to name it Realtime to emphasize the
> data freshness aspect. In retrospect, the views should have not been named
> after their performance characteristics but rather the classes of queries
> done on them and guarantees for those (point above #1). Moreover, once we
> have parquet embedded into the log format, then the tradeoffs may not be
> the same anyways.
>
> So combining with the renaming proposed in #1, we would end up with the
> following..
>
> Copy-On-Write :
> [Old]  Read-Optimized View =>  [New] Snapshot Query
> [Old]  Incremental View => [New] Incremental Query
>
> Merge-On-Read:
> [Old] Realtime View => [New] Snapshot Query
> [Old] Incremental View => [New] Incremental Query
> [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is read
> optimized compared to Snapshot query always, at the cost of staler data)
>
> Both changes #1 & #2 could be simpler changes to just code references, docs
> and configs.. we can support both string for sometime and deprecate
> eventually since queries are stateless.
>
> *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the
> design was very similar to https://en.wikipedia.org/wiki/Copy-on-write
> filesystems
> & snapshotting and we once hoped to push some of this logic into the
> storage itself, all in vain. but the name stuck, even though once we had
> MERGE_ON_READ the focus was often on merge costs etc, which the name
> COPY_ON_WRITE does not convey directly. I don't feel very strong about this
> and there is also cost to changing this since its persisted inside
> hoodie.properties and we will support both strings internally in code for
> backwards compatibility anyway
>
> Naming something is very hard (yes, try :)).I believe these changes will
> make the project simpler to understand for everyone out there. We also have
> tons of new people here, so I am also happy to let go, if its already clear
> :)
>
> Please use the bullet number when you share your feedback so we know what
> the discussion is about.
>
> Thanks
> Vinoth
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to