Re: [DISCUSS] Simplification of terminologies

Vinoth Chandar Wed, 13 Nov 2019 21:24:58 -0800

Will review the POC in cwiki.  +1

Based on this feedback, I will proceed with the changes. Thanks all!




On Tue, Nov 12, 2019 at 10:47 PM Semantic Beeng <[email protected]>
wrote:

> @vc, I think of it as elaborating the #ubiquitouslanguage in DDD.
> See private email with references to a small POC in wiki and decide how to
> proceed.
>
> On November 12, 2019 at 10:04 PM Vinoth Chandar < [email protected]>
> wrote:
>
>
> Thanks everyone for the feedback. Looks like we are in general agreement.
>
> I am inclined to just do 1 & 2 and leave COPY_ON_WRITE as is based on
> great
> points Ethan and Shiyan raised. Makes sense..
> Will wait for 1-2 days still to close this thread.
>
> @semanticbeeing Thats a great idea. Is it more like a technical glossary
> of
> sorts? Lets may be start a different DISCUSS thread on that specific
> topic,
> so everyone can chime in and provide more attention to that proposal?
>
>
>
>
>
> On Tue, Nov 12, 2019 at 2:44 PM Y. Ethan Guo < [email protected]>
> wrote:
>
> +1 on [1] and [2].
>
> For [3], I have similar doubts as Shiyan.
>
> For the naming, I can understand the original intent of the analogy for
> COW
> which is to make another "copy" of columnar/parquet file upon the
> modification/update to the records in the file. From the system design
> point of view, it's easy to understand. I'm okay with the renaming as
> "MERGE_ON_WRITE" since it's probably straightforward for users at the
> first
> glance.
>
> In terms of the concept, COW and MOR are listed as storage/table types.
> From my understanding, they represent different tradeoffs of the
> performance between reading and writing Hudi tables, and within MOR there
> are different tradeoffs, e.g., lazy merge on read or periodic compaction
> and cleaning pipelined along ingestion. It looks like these can be
> controlled through configs, e.g., "disable_merge_on_write",
> "compaction_frenquency", etc., instead of fixing the storage type, to
> control the tradeoff that a user would like to make. The requirement may
> change so a user can switch between COW and MOR by tuning the configs. We
> don't have to make such changes now, but I'm wondering if this is
> something
> worth considering in the future releases.
>
> - Ethan
>
> On Tue, Nov 12, 2019 at 8:43 AM nishith agarwal < [email protected]>
> wrote:
>
> +1 on the first two, don't feel strongly about (3).
>
> Thanks,
> Nishith
>
> On Tue, Nov 12, 2019 at 5:03 AM leesf < [email protected]> wrote:
>
> [1] +1. `views` indeed confused me a lot.
> [2] +1. `snapshot` is more reasonable.
> [3] I don't feel very strong to rename it, the current name
>
> `COPY_ON_WRITE`
>
> is reasonable considering the cost to rename and the behavior that new
> version parquet file will be created and seems to be copied from old
> version parquet file.
>
> Best,
> Leesf
>
> Balaji Varadarajan < [email protected]> 于2019年11月12日周二 下午3:55写道：
>
> Agree with all 3 changes. The naming now looks more consistent than
> earlier. +1 on them
>
> Depending on whether we are renaming Input formats for (1) and (2) -
>
> this
>
> could require some migration steps for
>
> Balaji.V
>
> >
>
> On Mon, Nov 11, 2019 at 7:38 PM vino yang < [email protected]>
>
> wrote:
>
> Hi Vinoth,
>
> Thanks for bringing these proposals.
>
> +1 on all three. Especially, big +1 on the third renaming proposal.
>
> When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It
>
> easily
>
> mislead users on the "copy" term. And make users compare it with
>
> the
>
> `CopyOnWriteArrayList` data structure provided by JDK and thoughts
>
> of
>
> the
>
> file systems.
>
> Best,
> Vino
>
> >
>
> Bhavani Sudha < [email protected]> 于2019年11月12日周二 上午9:05写道：
>
> +1 on all three rename proposals. I think this would make the
>
> concepts
>
> super easy to follow for new users.
>
> If changing [3] seems to be a stretch, we should definitely do
>
> [1]
>
> &
>
> [2]
>
> at
>
> the least IMO. I will be glad to help out on the renames to
>
> whatever
>
> extent
>
> possible should the Hudi community incline to pursue this.
>
> Thanks,
> Sudha
>
> >
> >
>
> On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar <
>
> [email protected]>
>
> wrote:
> >
>
> Hello all,
>
> I wanted to raise an important topic with the community around
>
> whether
>
> we
>
> should rename some of our terminologies in code/docs to be more
> user-friendly and understandable..
>
> Let me also provide some context for each, since I am probably
>
> guilty
>
> of
>
> introducing most of them in the first place :).
>
> *1. Rename "views" to "query" : *Instead of saying incremental
>
> view
>
> or
>
> read-optimized view, talk about them as "incremental query" and
> "read-optimized query". The term "view" is very technical, and
>
> what I
>
> was
>
> trying to convey was that we ingest/store the data once and
>
> expose
>
> views
>
> on
>
> top. But new users (atleast half dozen of them to me) tend to
>
> confuse
>
> this
>
> with views/materialized views found in databases. Almost always
>
> we
>
> talk
>
> about views mostly in terms of expected behavior for a query on
>
> the
>
> view. I
>
> am proposing to just call these different query types since
>
> its a
>
> more
>
> universally accepted terminology and IMO clearer.
>
> *2. Rename "Read-Optimized/Realtime" views to Snapshot views +
>
> Have
>
> Read-Optimized view only for MOR storage :* This one is
>
> probably
>
> the
>
> trickiest. Hudi was always designed with MOR in mind, even as
>
> we
>
> were
>
> working on COW storage and consequently we named the pure
>
> parquet
>
> backed
>
> view as Read-Optimized, hoping to name parquet + avro based
>
> view
>
> as
>
> Write-Optimized. However, we opted to name it Realtime to
>
> emphasize
>
> the
>
> data freshness aspect. In retrospect, the views should have not
>
> been
>
> named
>
> after their performance characteristics but rather the classes
>
> of
>
> queries
>
> done on them and guarantees for those (point above #1).
>
> Moreover,
>
> once
>
> we
>
> have parquet embedded into the log format, then the tradeoffs
>
> may
>
> not
>
> be
>
> the same anyways.
>
> So combining with the renaming proposed in #1, we would end up
>
> with
>
> the
>
> following..
>
> Copy-On-Write :
> [Old] Read-Optimized View => [New] Snapshot Query
> [Old] Incremental View => [New] Incremental Query
>
> Merge-On-Read:
> [Old] Realtime View => [New] Snapshot Query
> [Old] Incremental View => [New] Incremental Query
> [Old] ReadOptimzied View => [New] Read-Optimized Query (since
>
> it
>
> is
>
> read
>
> optimized compared to Snapshot query always, at the cost of
>
> staler
>
> data)
>
> Both changes #1 & #2 could be simpler changes to just code
>
> references,
>
> docs
>
> and configs.. we can support both string for sometime and
>
> deprecate
>
> eventually since queries are stateless.
>
> *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated
>
> since
>
> the
>
> design was very similar to
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Copy-2Don-2Dwrite&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=y9XF8-75xzGHY4yCbfVVWcIC1sbEXDxitqeAS2A6GoQ&e=
>
> filesystems
> & snapshotting and we once hoped to push some of this logic
>
> into
>
> the
>
> storage itself, all in vain. but the name stuck, even though
>
> once
>
> we
>
> had
>
> MERGE_ON_READ the focus was often on merge costs etc, which the
>
> name
>
> COPY_ON_WRITE does not convey directly. I don't feel very
>
> strong
>
> about
>
> this
>
> and there is also cost to changing this since its persisted
>
> inside
>
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__hoodie.properties&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=z456dQQXMUCz1m72nlkFQpylUpdOVMBG38x2peG1m44&m=m1yKGEwnAUe_FyIsWFAo-YVKyfq1nayItNGNc7iv8Yw&s=930ugGMXsrqzE-acg9nfeoePBmVjTRG3gD765ihEiqU&e=
>
> and we will support both strings internally in
>
> code
>
> for
>
> backwards compatibility anyway
>
> Naming something is very hard (yes, try :)).I believe these
>
> changes
>
> will
>
> make the project simpler to understand for everyone out there.
>
> We
>
> also
>
> have
>
> tons of new people here, so I am also happy to let go, if its
>
> already
>
> clear
>
> :)
>
> Please use the bullet number when you share your feedback so we
>
> know
>
> what
>
> the discussion is about.
>
> Thanks
> Vinoth
>
>

Re: [DISCUSS] Simplification of terminologies

Reply via email to