Re: [DISCUSS] Simplification of terminologies

leesf Tue, 12 Nov 2019 05:04:40 -0800

[1] +1. `views` indeed confused me a lot.
[2] +1. `snapshot` is more reasonable.
[3] I don't feel very strong to rename it, the current name `COPY_ON_WRITE`
is reasonable considering the cost to rename and the behavior that new
version parquet file will be created and seems to be copied from old
version parquet file.


Best,
Leesf

Balaji Varadarajan <vbal...@apache.org> 于2019年11月12日周二 下午3:55写道：

> Agree with all 3 changes. The naming now looks more consistent than
> earlier. +1 on them
>
> Depending on whether we are renaming Input formats for (1) and (2) - this
> could require some migration steps for
>
> Balaji.V
>
>
> On Mon, Nov 11, 2019 at 7:38 PM vino yang <yanghua1...@gmail.com> wrote:
>
> > Hi Vinoth,
> >
> > Thanks for bringing these proposals.
> >
> > +1 on all three. Especially, big +1 on the third renaming proposal.
> >
> > When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It
> easily
> > mislead users on the "copy" term. And make users compare it with the
> > `CopyOnWriteArrayList` data structure provided by JDK  and thoughts of
> the
> > file systems.
> >
> > Best,
> > Vino
> >
> >
> > Bhavani Sudha <bhavanisud...@gmail.com> 于2019年11月12日周二 上午9:05写道：
> >
> > > +1 on all three rename proposals. I think this would make the concepts
> > > super easy to follow for new users.
> > >
> > > If changing [3] seems to be a stretch, we should definitely do [1] &
> [2]
> > at
> > > the least IMO. I will be glad to help out on the renames to whatever
> > extent
> > > possible should the Hudi community incline to pursue this.
> > >
> > > Thanks,
> > > Sudha
> > >
> > >
> > >
> > > On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <vin...@apache.org>
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I wanted to raise an important topic with the community around
> whether
> > we
> > > > should rename some of our terminologies in code/docs to be more
> > > > user-friendly and understandable..
> > > >
> > > > Let me also provide some context for each, since I am probably guilty
> > of
> > > > introducing most of them in the first place :).
> > > >
> > > > *1. Rename "views" to "query" : *Instead of saying incremental view
> or
> > > > read-optimized view, talk about them as "incremental query" and
> > > > "read-optimized query". The term "view" is very technical, and what I
> > was
> > > > trying to convey was that we ingest/store the data once and expose
> > views
> > > on
> > > > top. But new users (atleast half dozen of them to me) tend to confuse
> > > this
> > > > with views/materialized views found in databases. Almost always we
> talk
> > > > about views mostly in terms of expected behavior for a query on the
> > > view. I
> > > > am proposing to just call these different query types since its a
> more
> > > > universally accepted terminology and IMO clearer.
> > > >
> > > > *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> > > > Read-Optimized view only for MOR storage :* This one is probably the
> > > > trickiest. Hudi was always designed with MOR in mind, even as we were
> > > > working on COW storage and consequently we named the pure parquet
> > backed
> > > > view as Read-Optimized, hoping to name parquet + avro based view as
> > > > Write-Optimized. However, we opted to name it Realtime to emphasize
> the
> > > > data freshness aspect. In retrospect, the views should have not been
> > > named
> > > > after their performance characteristics but rather the classes of
> > queries
> > > > done on them and guarantees for those (point above #1). Moreover,
> once
> > we
> > > > have parquet embedded into the log format, then the tradeoffs may not
> > be
> > > > the same anyways.
> > > >
> > > > So combining with the renaming proposed in #1, we would end up with
> the
> > > > following..
> > > >
> > > > Copy-On-Write :
> > > > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > > > [Old]  Incremental View => [New] Incremental Query
> > > >
> > > > Merge-On-Read:
> > > > [Old] Realtime View => [New] Snapshot Query
> > > > [Old] Incremental View => [New] Incremental Query
> > > > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is
> > read
> > > > optimized compared to Snapshot query always, at the cost of staler
> > data)
> > > >
> > > > Both changes #1 & #2 could be simpler changes to just code
> references,
> > > docs
> > > > and configs.. we can support both string for sometime and deprecate
> > > > eventually since queries are stateless.
> > > >
> > > > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since
> the
> > > > design was very similar to
> https://en.wikipedia.org/wiki/Copy-on-write
> > > > filesystems
> > > > & snapshotting and we once hoped to push some of this logic into the
> > > > storage itself, all in vain. but the name stuck, even though once we
> > had
> > > > MERGE_ON_READ the focus was often on merge costs etc, which the name
> > > > COPY_ON_WRITE does not convey directly. I don't feel very strong
> about
> > > this
> > > > and there is also cost to changing this since its persisted inside
> > > > hoodie.properties and we will support both strings internally in code
> > for
> > > > backwards compatibility anyway
> > > >
> > > > Naming something is very hard (yes, try :)).I believe these changes
> > will
> > > > make the project simpler to understand for everyone out there. We
> also
> > > have
> > > > tons of new people here, so I am also happy to let go, if its already
> > > clear
> > > > :)
> > > >
> > > > Please use the bullet number when you share your feedback so we know
> > what
> > > > the discussion is about.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to