Re: [DISCUSS] Simplification of terminologies

Y. Ethan Guo Tue, 12 Nov 2019 14:44:46 -0800

+1 on [1] and [2].

For [3], I have similar doubts as Shiyan.


For the naming, I can understand the original intent of the analogy for COW
which is to make another "copy" of columnar/parquet file upon the
modification/update to the records in the file.  From the system design
point of view, it's easy to understand.  I'm okay with the renaming as
"MERGE_ON_WRITE" since it's probably straightforward for users at the first
glance.

In terms of the concept, COW and MOR are listed as storage/table types.
>From my understanding, they represent different tradeoffs of the
performance between reading and writing Hudi tables, and within MOR there
are different tradeoffs, e.g., lazy merge on read or periodic compaction
and cleaning pipelined along ingestion. It looks like these can be
controlled through configs, e.g., "disable_merge_on_write",
"compaction_frenquency", etc., instead of fixing the storage type, to
control the tradeoff that a user would like to make.  The requirement may
change so a user can switch between COW and MOR by tuning the configs. We
don't have to make such changes now, but I'm wondering if this is something
worth considering in the future releases.

- Ethan

On Tue, Nov 12, 2019 at 8:43 AM nishith agarwal <[email protected]> wrote:

> +1 on the first two, don't feel strongly about (3).
>
> Thanks,
> Nishith
>
> On Tue, Nov 12, 2019 at 5:03 AM leesf <[email protected]> wrote:
>
> > [1] +1. `views` indeed confused me a lot.
> > [2] +1. `snapshot` is more reasonable.
> > [3] I don't feel very strong to rename it, the current name
> `COPY_ON_WRITE`
> > is reasonable considering the cost to rename and the behavior that new
> > version parquet file will be created and seems to be copied from old
> > version parquet file.
> >
> > Best,
> > Leesf
> >
> > Balaji Varadarajan <[email protected]> 于2019年11月12日周二 下午3:55写道：
> >
> > > Agree with all 3 changes. The naming now looks more consistent than
> > > earlier. +1 on them
> > >
> > > Depending on whether we are renaming Input formats for (1) and (2) -
> this
> > > could require some migration steps for
> > >
> > > Balaji.V
> > >
> > >
> > > On Mon, Nov 11, 2019 at 7:38 PM vino yang <[email protected]>
> wrote:
> > >
> > > > Hi Vinoth,
> > > >
> > > > Thanks for bringing these proposals.
> > > >
> > > > +1 on all three. Especially, big +1 on the third renaming proposal.
> > > >
> > > > When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It
> > > easily
> > > > mislead users on the "copy" term. And make users compare it with the
> > > > `CopyOnWriteArrayList` data structure provided by JDK  and thoughts
> of
> > > the
> > > > file systems.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > >
> > > > Bhavani Sudha <[email protected]> 于2019年11月12日周二 上午9:05写道：
> > > >
> > > > > +1 on all three rename proposals. I think this would make the
> > concepts
> > > > > super easy to follow for new users.
> > > > >
> > > > > If changing [3] seems to be a stretch, we should definitely do [1]
> &
> > > [2]
> > > > at
> > > > > the least IMO. I will be glad to help out on the renames to
> whatever
> > > > extent
> > > > > possible should the Hudi community incline to pursue this.
> > > > >
> > > > > Thanks,
> > > > > Sudha
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I wanted to raise an important topic with the community around
> > > whether
> > > > we
> > > > > > should rename some of our terminologies in code/docs to be more
> > > > > > user-friendly and understandable..
> > > > > >
> > > > > > Let me also provide some context for each, since I am probably
> > guilty
> > > > of
> > > > > > introducing most of them in the first place :).
> > > > > >
> > > > > > *1. Rename "views" to "query" : *Instead of saying incremental
> view
> > > or
> > > > > > read-optimized view, talk about them as "incremental query" and
> > > > > > "read-optimized query". The term "view" is very technical, and
> > what I
> > > > was
> > > > > > trying to convey was that we ingest/store the data once and
> expose
> > > > views
> > > > > on
> > > > > > top. But new users (atleast half dozen of them to me) tend to
> > confuse
> > > > > this
> > > > > > with views/materialized views found in databases. Almost always
> we
> > > talk
> > > > > > about views mostly in terms of expected behavior for a query on
> the
> > > > > view. I
> > > > > > am proposing to just call these different query types since its a
> > > more
> > > > > > universally accepted terminology and IMO clearer.
> > > > > >
> > > > > > *2. Rename "Read-Optimized/Realtime" views to Snapshot views +
> Have
> > > > > > Read-Optimized view only for MOR storage :* This one is probably
> > the
> > > > > > trickiest. Hudi was always designed with MOR in mind, even as we
> > were
> > > > > > working on COW storage and consequently we named the pure parquet
> > > > backed
> > > > > > view as Read-Optimized, hoping to name parquet + avro based view
> as
> > > > > > Write-Optimized. However, we opted to name it Realtime to
> emphasize
> > > the
> > > > > > data freshness aspect. In retrospect, the views should have not
> > been
> > > > > named
> > > > > > after their performance characteristics but rather the classes of
> > > > queries
> > > > > > done on them and guarantees for those (point above #1). Moreover,
> > > once
> > > > we
> > > > > > have parquet embedded into the log format, then the tradeoffs may
> > not
> > > > be
> > > > > > the same anyways.
> > > > > >
> > > > > > So combining with the renaming proposed in #1, we would end up
> with
> > > the
> > > > > > following..
> > > > > >
> > > > > > Copy-On-Write :
> > > > > > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > > > > > [Old]  Incremental View => [New] Incremental Query
> > > > > >
> > > > > > Merge-On-Read:
> > > > > > [Old] Realtime View => [New] Snapshot Query
> > > > > > [Old] Incremental View => [New] Incremental Query
> > > > > > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it
> is
> > > > read
> > > > > > optimized compared to Snapshot query always, at the cost of
> staler
> > > > data)
> > > > > >
> > > > > > Both changes #1 & #2 could be simpler changes to just code
> > > references,
> > > > > docs
> > > > > > and configs.. we can support both string for sometime and
> deprecate
> > > > > > eventually since queries are stateless.
> > > > > >
> > > > > > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated
> since
> > > the
> > > > > > design was very similar to
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Copy-2Don-2Dwrite&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=y9XF8-75xzGHY4yCbfVVWcIC1sbEXDxitqeAS2A6GoQ&e=
> > > > > > filesystems
> > > > > > & snapshotting and we once hoped to push some of this logic into
> > the
> > > > > > storage itself, all in vain. but the name stuck, even though once
> > we
> > > > had
> > > > > > MERGE_ON_READ the focus was often on merge costs etc, which the
> > name
> > > > > > COPY_ON_WRITE does not convey directly. I don't feel very strong
> > > about
> > > > > this
> > > > > > and there is also cost to changing this since its persisted
> inside
> > > > > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__hoodie.properties&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=930ugGMXsrqzE-acg9nfeoePBmVjTRG3gD765ihEiqU&e=
> and we will support both strings internally in
> > code
> > > > for
> > > > > > backwards compatibility anyway
> > > > > >
> > > > > > Naming something is very hard (yes, try :)).I believe these
> changes
> > > > will
> > > > > > make the project simpler to understand for everyone out there. We
> > > also
> > > > > have
> > > > > > tons of new people here, so I am also happy to let go, if its
> > already
> > > > > clear
> > > > > > :)
> > > > > >
> > > > > > Please use the bullet number when you share your feedback so we
> > know
> > > > what
> > > > > > the discussion is about.
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to