Re: [DISCUSS] Simplification of terminologies

Shiyan Xu Mon, 11 Nov 2019 19:35:57 -0800

[1] +1; "query" indeed sounds better
[2] +1 on the term "snapshot"; so basically we follow the convention that
when we say "snapshot", it means "give me the most up-to-date facts (lowest
data latency) even if it takes some query time"
[3] Though I agree with the renaming, I have a different perspective to
raise on the table types:
MOR is a superset of COW; I suppose a user can theoretically configure a
Hudi streamer to write to a MOR table and make it behave equivalently to a
COW table, am I right? I imagine that involves scheduling compaction/clean
right after write operations and hence making RO view and RT view close to
each other. So what will be the advantage of defining COW tables if MOR can
do everything? Would be happy to get more insights on the benefits of
defining COW over MOR.
So based on COW being an subset and yielded from a special configuration of
MOR, my thoughts are: can we just keep MOR and deprecate COW? In case when
users don't need RT view, can we provide a flag like
"--disable-realtime-view/query" to help achieve the original COW features?
So back to the renaming, if COW can be achieved by changing configs of MOR,
then we could potentially save the hassles of renaming and just deprecate
the type.


On Mon, Nov 11, 2019 at 5:05 PM Bhavani Sudha <[email protected]>
wrote:

> +1 on all three rename proposals. I think this would make the concepts
> super easy to follow for new users.
>
> If changing [3] seems to be a stretch, we should definitely do [1] & [2] at
> the least IMO. I will be glad to help out on the renames to whatever extent
> possible should the Hudi community incline to pursue this.
>
> Thanks,
> Sudha
>
>
>
> On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <[email protected]> wrote:
>
> > Hello all,
> >
> > I wanted to raise an important topic with the community around whether we
> > should rename some of our terminologies in code/docs to be more
> > user-friendly and understandable..
> >
> > Let me also provide some context for each, since I am probably guilty of
> > introducing most of them in the first place :).
> >
> > *1. Rename "views" to "query" : *Instead of saying incremental view or
> > read-optimized view, talk about them as "incremental query" and
> > "read-optimized query". The term "view" is very technical, and what I was
> > trying to convey was that we ingest/store the data once and expose views
> on
> > top. But new users (atleast half dozen of them to me) tend to confuse
> this
> > with views/materialized views found in databases. Almost always we talk
> > about views mostly in terms of expected behavior for a query on the
> view. I
> > am proposing to just call these different query types since its a more
> > universally accepted terminology and IMO clearer.
> >
> > *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> > Read-Optimized view only for MOR storage :* This one is probably the
> > trickiest. Hudi was always designed with MOR in mind, even as we were
> > working on COW storage and consequently we named the pure parquet backed
> > view as Read-Optimized, hoping to name parquet + avro based view as
> > Write-Optimized. However, we opted to name it Realtime to emphasize the
> > data freshness aspect. In retrospect, the views should have not been
> named
> > after their performance characteristics but rather the classes of queries
> > done on them and guarantees for those (point above #1). Moreover, once we
> > have parquet embedded into the log format, then the tradeoffs may not be
> > the same anyways.
> >
> > So combining with the renaming proposed in #1, we would end up with the
> > following..
> >
> > Copy-On-Write :
> > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > [Old]  Incremental View => [New] Incremental Query
> >
> > Merge-On-Read:
> > [Old] Realtime View => [New] Snapshot Query
> > [Old] Incremental View => [New] Incremental Query
> > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is read
> > optimized compared to Snapshot query always, at the cost of staler data)
> >
> > Both changes #1 & #2 could be simpler changes to just code references,
> docs
> > and configs.. we can support both string for sometime and deprecate
> > eventually since queries are stateless.
> >
> > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the
> > design was very similar to https://en.wikipedia.org/wiki/Copy-on-write
> > filesystems
> > & snapshotting and we once hoped to push some of this logic into the
> > storage itself, all in vain. but the name stuck, even though once we had
> > MERGE_ON_READ the focus was often on merge costs etc, which the name
> > COPY_ON_WRITE does not convey directly. I don't feel very strong about
> this
> > and there is also cost to changing this since its persisted inside
> > hoodie.properties and we will support both strings internally in code for
> > backwards compatibility anyway
> >
> > Naming something is very hard (yes, try :)).I believe these changes will
> > make the project simpler to understand for everyone out there. We also
> have
> > tons of new people here, so I am also happy to let go, if its already
> clear
> > :)
> >
> > Please use the bullet number when you share your feedback so we know what
> > the discussion is about.
> >
> > Thanks
> > Vinoth
> >
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to