Re: [DISCUSS] Simplification of terminologies

nishith agarwal Tue, 12 Nov 2019 08:43:59 -0800

+1 on the first two, don't feel strongly about (3).

Thanks,
Nishith


On Tue, Nov 12, 2019 at 5:03 AM leesf <[email protected]> wrote:

> [1] +1. `views` indeed confused me a lot.
> [2] +1. `snapshot` is more reasonable.
> [3] I don't feel very strong to rename it, the current name `COPY_ON_WRITE`
> is reasonable considering the cost to rename and the behavior that new
> version parquet file will be created and seems to be copied from old
> version parquet file.
>
> Best,
> Leesf
>
> Balaji Varadarajan <[email protected]> 于2019年11月12日周二 下午3:55写道：
>
> > Agree with all 3 changes. The naming now looks more consistent than
> > earlier. +1 on them
> >
> > Depending on whether we are renaming Input formats for (1) and (2) - this
> > could require some migration steps for
> >
> > Balaji.V
> >
> >
> > On Mon, Nov 11, 2019 at 7:38 PM vino yang <[email protected]> wrote:
> >
> > > Hi Vinoth,
> > >
> > > Thanks for bringing these proposals.
> > >
> > > +1 on all three. Especially, big +1 on the third renaming proposal.
> > >
> > > When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It
> > easily
> > > mislead users on the "copy" term. And make users compare it with the
> > > `CopyOnWriteArrayList` data structure provided by JDK  and thoughts of
> > the
> > > file systems.
> > >
> > > Best,
> > > Vino
> > >
> > >
> > > Bhavani Sudha <[email protected]> 于2019年11月12日周二 上午9:05写道：
> > >
> > > > +1 on all three rename proposals. I think this would make the
> concepts
> > > > super easy to follow for new users.
> > > >
> > > > If changing [3] seems to be a stretch, we should definitely do [1] &
> > [2]
> > > at
> > > > the least IMO. I will be glad to help out on the renames to whatever
> > > extent
> > > > possible should the Hudi community incline to pursue this.
> > > >
> > > > Thanks,
> > > > Sudha
> > > >
> > > >
> > > >
> > > > On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <[email protected]>
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I wanted to raise an important topic with the community around
> > whether
> > > we
> > > > > should rename some of our terminologies in code/docs to be more
> > > > > user-friendly and understandable..
> > > > >
> > > > > Let me also provide some context for each, since I am probably
> guilty
> > > of
> > > > > introducing most of them in the first place :).
> > > > >
> > > > > *1. Rename "views" to "query" : *Instead of saying incremental view
> > or
> > > > > read-optimized view, talk about them as "incremental query" and
> > > > > "read-optimized query". The term "view" is very technical, and
> what I
> > > was
> > > > > trying to convey was that we ingest/store the data once and expose
> > > views
> > > > on
> > > > > top. But new users (atleast half dozen of them to me) tend to
> confuse
> > > > this
> > > > > with views/materialized views found in databases. Almost always we
> > talk
> > > > > about views mostly in terms of expected behavior for a query on the
> > > > view. I
> > > > > am proposing to just call these different query types since its a
> > more
> > > > > universally accepted terminology and IMO clearer.
> > > > >
> > > > > *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> > > > > Read-Optimized view only for MOR storage :* This one is probably
> the
> > > > > trickiest. Hudi was always designed with MOR in mind, even as we
> were
> > > > > working on COW storage and consequently we named the pure parquet
> > > backed
> > > > > view as Read-Optimized, hoping to name parquet + avro based view as
> > > > > Write-Optimized. However, we opted to name it Realtime to emphasize
> > the
> > > > > data freshness aspect. In retrospect, the views should have not
> been
> > > > named
> > > > > after their performance characteristics but rather the classes of
> > > queries
> > > > > done on them and guarantees for those (point above #1). Moreover,
> > once
> > > we
> > > > > have parquet embedded into the log format, then the tradeoffs may
> not
> > > be
> > > > > the same anyways.
> > > > >
> > > > > So combining with the renaming proposed in #1, we would end up with
> > the
> > > > > following..
> > > > >
> > > > > Copy-On-Write :
> > > > > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > > > > [Old]  Incremental View => [New] Incremental Query
> > > > >
> > > > > Merge-On-Read:
> > > > > [Old] Realtime View => [New] Snapshot Query
> > > > > [Old] Incremental View => [New] Incremental Query
> > > > > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is
> > > read
> > > > > optimized compared to Snapshot query always, at the cost of staler
> > > data)
> > > > >
> > > > > Both changes #1 & #2 could be simpler changes to just code
> > references,
> > > > docs
> > > > > and configs.. we can support both string for sometime and deprecate
> > > > > eventually since queries are stateless.
> > > > >
> > > > > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since
> > the
> > > > > design was very similar to
> > https://en.wikipedia.org/wiki/Copy-on-write
> > > > > filesystems
> > > > > & snapshotting and we once hoped to push some of this logic into
> the
> > > > > storage itself, all in vain. but the name stuck, even though once
> we
> > > had
> > > > > MERGE_ON_READ the focus was often on merge costs etc, which the
> name
> > > > > COPY_ON_WRITE does not convey directly. I don't feel very strong
> > about
> > > > this
> > > > > and there is also cost to changing this since its persisted inside
> > > > > hoodie.properties and we will support both strings internally in
> code
> > > for
> > > > > backwards compatibility anyway
> > > > >
> > > > > Naming something is very hard (yes, try :)).I believe these changes
> > > will
> > > > > make the project simpler to understand for everyone out there. We
> > also
> > > > have
> > > > > tons of new people here, so I am also happy to let go, if its
> already
> > > > clear
> > > > > :)
> > > > >
> > > > > Please use the bullet number when you share your feedback so we
> know
> > > what
> > > > > the discussion is about.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to