Re: [DISCUSS] Simplification of terminologies

Balaji Varadarajan Mon, 11 Nov 2019 23:56:27 -0800

Agree with all 3 changes. The naming now looks more consistent than
earlier. +1 on them


Depending on whether we are renaming Input formats for (1) and (2) - this
could require some migration steps for

Balaji.V


On Mon, Nov 11, 2019 at 7:38 PM vino yang <[email protected]> wrote:

> Hi Vinoth,
>
> Thanks for bringing these proposals.
>
> +1 on all three. Especially, big +1 on the third renaming proposal.
>
> When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It easily
> mislead users on the "copy" term. And make users compare it with the
> `CopyOnWriteArrayList` data structure provided by JDK  and thoughts of the
> file systems.
>
> Best,
> Vino
>
>
> Bhavani Sudha <[email protected]> 于2019年11月12日周二 上午9:05写道：
>
> > +1 on all three rename proposals. I think this would make the concepts
> > super easy to follow for new users.
> >
> > If changing [3] seems to be a stretch, we should definitely do [1] & [2]
> at
> > the least IMO. I will be glad to help out on the renames to whatever
> extent
> > possible should the Hudi community incline to pursue this.
> >
> > Thanks,
> > Sudha
> >
> >
> >
> > On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> > > Hello all,
> > >
> > > I wanted to raise an important topic with the community around whether
> we
> > > should rename some of our terminologies in code/docs to be more
> > > user-friendly and understandable..
> > >
> > > Let me also provide some context for each, since I am probably guilty
> of
> > > introducing most of them in the first place :).
> > >
> > > *1. Rename "views" to "query" : *Instead of saying incremental view or
> > > read-optimized view, talk about them as "incremental query" and
> > > "read-optimized query". The term "view" is very technical, and what I
> was
> > > trying to convey was that we ingest/store the data once and expose
> views
> > on
> > > top. But new users (atleast half dozen of them to me) tend to confuse
> > this
> > > with views/materialized views found in databases. Almost always we talk
> > > about views mostly in terms of expected behavior for a query on the
> > view. I
> > > am proposing to just call these different query types since its a more
> > > universally accepted terminology and IMO clearer.
> > >
> > > *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> > > Read-Optimized view only for MOR storage :* This one is probably the
> > > trickiest. Hudi was always designed with MOR in mind, even as we were
> > > working on COW storage and consequently we named the pure parquet
> backed
> > > view as Read-Optimized, hoping to name parquet + avro based view as
> > > Write-Optimized. However, we opted to name it Realtime to emphasize the
> > > data freshness aspect. In retrospect, the views should have not been
> > named
> > > after their performance characteristics but rather the classes of
> queries
> > > done on them and guarantees for those (point above #1). Moreover, once
> we
> > > have parquet embedded into the log format, then the tradeoffs may not
> be
> > > the same anyways.
> > >
> > > So combining with the renaming proposed in #1, we would end up with the
> > > following..
> > >
> > > Copy-On-Write :
> > > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > > [Old]  Incremental View => [New] Incremental Query
> > >
> > > Merge-On-Read:
> > > [Old] Realtime View => [New] Snapshot Query
> > > [Old] Incremental View => [New] Incremental Query
> > > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is
> read
> > > optimized compared to Snapshot query always, at the cost of staler
> data)
> > >
> > > Both changes #1 & #2 could be simpler changes to just code references,
> > docs
> > > and configs.. we can support both string for sometime and deprecate
> > > eventually since queries are stateless.
> > >
> > > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the
> > > design was very similar to https://en.wikipedia.org/wiki/Copy-on-write
> > > filesystems
> > > & snapshotting and we once hoped to push some of this logic into the
> > > storage itself, all in vain. but the name stuck, even though once we
> had
> > > MERGE_ON_READ the focus was often on merge costs etc, which the name
> > > COPY_ON_WRITE does not convey directly. I don't feel very strong about
> > this
> > > and there is also cost to changing this since its persisted inside
> > > hoodie.properties and we will support both strings internally in code
> for
> > > backwards compatibility anyway
> > >
> > > Naming something is very hard (yes, try :)).I believe these changes
> will
> > > make the project simpler to understand for everyone out there. We also
> > have
> > > tons of new people here, so I am also happy to let go, if its already
> > clear
> > > :)
> > >
> > > Please use the bullet number when you share your feedback so we know
> what
> > > the discussion is about.
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to