Re: Re: [DISCUSS] New Flink Writer Proposal

Vinoth Chandar Thu, 07 Jan 2021 00:40:27 -0800

+1 on Gary's comment. I would like for us to integrate as a Sink into Flink
as well, i.e you can write custom code and then sink data into hudi.
In addition to the app, that we have going.


I have not spent enough time on this myself. So I'll spend some time
understanding the tradeoffs here fully, before casting my opinion.

I also think we should design it as thoroughly as possible, taking the time
needed, with things that Hudi enables, that the competitors you listed.
Things like indexing, streaming reads - using Hudi as the state backend.
These are unique things we can leverage/adapt.
Just calling these out, so we are not fully focussed on the table format
layer alone.
Hudi has a rich set of table services already built out (cleaning,
compaction, clustering, ... .. ) . How do we enable all of this and make
them as
prod ready as Spark.





On Thu, Jan 7, 2021 at 12:32 AM Gary Li <[email protected]> wrote:

> Hi all,
>
> IIUC the current flink writer is like an app, just like the delta
> streamer. If we want to build another Flink writer, we can still share the
> same flink client right? Does the flink client also have to use the new
> feature only available on Flink 1.12?
>
> Thanks,
> Gary Li
> ________________________________
> From: Danny Chan <[email protected]>
> Sent: Thursday, January 7, 2021 10:19 AM
> To: [email protected] <[email protected]>
> Subject: Re: Re: [DISCUSS] New Flink Writer Proposal
>
> Thanks vino yang ~
>
> IMO, we should not put much put too much energy for current Flink writer,
> it is not production-ready in the long run. There are so many features need
> to add/support for the Flink write/read(MOR write, COR read, MOR read, the
> new index), we should focus on one version first, make it robust.
>
> I really hope that we can work together to make the writer production-ready
> as soon as possible, it is competitive that we have competitors like Apache
> Iceberg and Delta lake, so from this perspective, there is no benefit to be
> compatible with the current version writer.
>
> My idea is that i propose the new infrastructure first as quickly as
> possible(the basic pipeline, the test framework.), and then we can work
> together for the new version (MOR write, COR read, MOR read, the new
> index), we better not distract from promote the old writer.
>
> What do you think?
>
> vino yang <[email protected]> 于2021年1月6日周三 下午2:14写道：
>
> > Hi Danny,
> >
> > As we discussed in the doc, we should agree on if we should be compatible
> > with the version less than Flink 1.11/1.12.
> >
> > We all know that there are some bottlenecks in the current plan. You
> > proposed some improvements, yes it is great, but it radically uses the
> > newer features provided by Flink. It is a pity that some users of old
> > versions of Flink have no way to benefit from these features.
> >
> > The information I can provide is that some users have already used the
> > current Flink write client or its improved version in a production
> > environment. For example, SF Technology, and the Flink versions they use
> > are 1.8.x and 1.10.x.
> >
> > Therefore, I personally suggest that there are two options:
> >
> > 1) The new design takes into account users of lower versions as much as
> > possible and maintains a client version;
> > 2) The new design is based on the features of the new version and evolves
> > separately from the old version(we also have a plan to optimize the
> current
> > implementation), but the public abstraction can be reused. I think it is
> > not impossible to maintain multiple versions. Flink used to support 4+
> > versions (0.8.2, 0.9, 0.10, 0.11, universal connector) for Kafka
> Connector,
> > but they share the same code base.
> >
> > Any thoughts and opinions are welcome and appreciated.
> >
> > Best,
> > Vino
> >
> > vino yang <[email protected]> 于2021年1月6日周三 下午1:37写道：
> >
> > > Hi Danny,
> > >
> > > You should have cwiki edit permission now.
> > > Any problems let me know.
> > >
> > > Best,
> > > Vino
> > >
> > > Danny Chan <[email protected]> 于2021年1月6日周三 下午12:05写道：
> > >
> > >> Sorry ~
> > >>
> > >> Forget to say that my Confluence ID is danny0405.
> > >>
> > >> It would be nice if any of you can help on this.
> > >>
> > >> Best,
> > >> Danny Chan
> > >>
> > >> Danny Chan <[email protected]> 于2021年1月6日周三 下午12:00写道：
> > >>
> > >> > Hi, can someone give me the CWIKI permission so that i can update
> the
> > >> > design details to that (maybe as a new RFC though ~).
> > >> >
> > >> > wangxianghu <[email protected]> 于2021年1月5日周二 下午2:43写道：
> > >> >
> > >> >> + 1, Thanks Danny!
> > >> >> I believe this new feature OperatorConrdinator in flink-1.11 will
> > help
> > >> >> improve the current implementation
> > >> >>
> > >> >> Best,
> > >> >>
> > >> >> XianghuWang
> > >> >>
> > >> >> At 2021-01-05 14:17:37, "vino yang" <[email protected]> wrote:
> > >> >> >Hi,
> > >> >> >
> > >> >> >Sharing more details, the OperatorConrdinator is the part of the
> new
> > >> Data
> > >> >> >Source API(Beta) involved in the Flink 1.11's release note[1].
> > >> >> >
> > >> >> >Flink 1.11 was released only about half a year ago. The design of
> > >> RFC-13
> > >> >> >began at the end of 2019, and most of the implementation was
> > completed
> > >> >> when
> > >> >> >Flink 1.11 was released.
> > >> >> >
> > >> >> >I believe that the production environment of many large companies
> > has
> > >> not
> > >> >> >been upgraded so quickly (As far as our company is concerned, we
> > still
> > >> >> have
> > >> >> >some jobs running on flink release packages below 1.9).
> > >> >> >
> > >> >> >So, maybe we need to find a mechanism to benefit both new and old
> > >> users.
> > >> >> >
> > >> >> >[1]:
> > >> >> >
> > >> >>
> > >>
> >
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2Fnews%2F2020%2F07%2F06%2Frelease-1.11.0.html%23new-data-source-api-beta&amp;data=04%7C01%7C%7C13a290f5a7384113903908d8b2b2b61c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637455827910723007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1E2aobHPbYSUouGLmcTE2Lt225%2F9DFWxXTjz5oxtqR0%3D&amp;reserved=0
> > >> >> >
> > >> >> >Best,
> > >> >> >Vino
> > >> >> >
> > >> >> >vino yang <[email protected]> 于2021年1月5日周二 下午12:30写道：
> > >> >> >
> > >> >> >> Hi,
> > >> >> >>
> > >> >> >> +1, thank you Danny for introducing this new feature
> > >> >> >> (OperatorCoordinator)[1] of Flink in the recently latest
> version.
> > >> >> >> This feature is very helpful for improving the implementation
> > >> >> mechanism of
> > >> >> >> Flink write-client.
> > >> >> >>
> > >> >> >> But this feature is only available after Flink 1.11. Before
> that,
> > >> there
> > >> >> >> was no good way to realize the mechanism of task upstream and
> > >> >> downstream
> > >> >> >> coordination through the public API provided by Flink.
> > >> >> >> I just have a concern, whether we need to take into account the
> > >> users
> > >> >> of
> > >> >> >> earlier versions (less than Flink 1.11).
> > >> >> >>
> > >> >> >> [1]:
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-15099&amp;data=04%7C01%7C%7C13a290f5a7384113903908d8b2b2b61c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637455827910723007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vKn0x0ePL0rmJk1H%2BZgHevVBjNIdXrbEI8srchOl1c4%3D&amp;reserved=0
> > >> >> >>
> > >> >> >> Best,
> > >> >> >> Vino
> > >> >> >>
> > >> >> >> Gary Li <[email protected]> 于2021年1月5日周二 上午10:40写道：
> > >> >> >>
> > >> >> >>> Hi Danny,
> > >> >> >>>
> > >> >> >>> Thanks for the proposal. I'd recommend starting a new RFC.
> RFC-13
> > >> was
> > >> >> >>> done and including some work about the refactoring so we should
> > >> mark
> > >> >> it as
> > >> >> >>> completed. Looking forward to having further discussion on the
> > RFC.
> > >> >> >>>
> > >> >> >>> Best,
> > >> >> >>> Gary Li
> > >> >> >>> ________________________________
> > >> >> >>> From: Danny Chan <[email protected]>
> > >> >> >>> Sent: Tuesday, January 5, 2021 10:22 AM
> > >> >> >>> To: [email protected] <[email protected]>
> > >> >> >>> Subject: Re: [DISCUSS] New Flink Writer Proposal
> > >> >> >>>
> > >> >> >>> Sure, i can update the RFC-13 cwiki if you agree with that.
> > >> >> >>>
> > >> >> >>> Vinoth Chandar <[email protected]> 于2021年1月5日周二 上午2:58写道：
> > >> >> >>>
> > >> >> >>> > Overall +1 on the idea.
> > >> >> >>> >
> > >> >> >>> > Danny, could we move this to the apache cwiki if you don't
> > mind?
> > >> >> >>> > That's what we have been using for other RFC discussions.
> > >> >> >>> >
> > >> >> >>> > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan <
> > [email protected]>
> > >> >> wrote:
> > >> >> >>> >
> > >> >> >>> > > The RFC-13 Flink writer has some bottlenecks that make it
> > hard
> > >> to
> > >> >> >>> adapter
> > >> >> >>> > > to production:
> > >> >> >>> > >
> > >> >> >>> > > - The InstantGeneratorOperator is parallelism 1, which is a
> > >> limit
> > >> >> for
> > >> >> >>> > > high-throughput consumption; because all the split inputs
> > drain
> > >> >> to a
> > >> >> >>> > single
> > >> >> >>> > > thread, the network IO would gains pressure too
> > >> >> >>> > > - The WriteProcessOperator handles inputs by partition,
> that
> > >> >> means,
> > >> >> >>> > within
> > >> >> >>> > > each partition write process, the BUCKETs are written one
> by
> > >> one,
> > >> >> the
> > >> >> >>> > FILE
> > >> >> >>> > > IO is limit to adapter to high-throughput inputs
> > >> >> >>> > > - It buffers the data by checkpoints, which is too hard to
> be
> > >> >> robust
> > >> >> >>> for
> > >> >> >>> > > production, the checkpoint function is blocking and should
> > not
> > >> >> have IO
> > >> >> >>> > > operations.
> > >> >> >>> > > - The FlinkHoodieIndex is only valid for a per-job scope,
> it
> > >> does
> > >> >> not
> > >> >> >>> > work
> > >> >> >>> > > for existing bootstrap data or for different Flink jobs
> > >> >> >>> > >
> > >> >> >>> > > Thus, here I propose a new design for the Flink writer to
> > solve
> > >> >> these
> > >> >> >>> > > problems[1]. Overall, the new design tries to remove the
> > single
> > >> >> >>> > parallelism
> > >> >> >>> > > operators and make the index more powerful and scalable.
> > >> >> >>> > >
> > >> >> >>> > > I plan to solve these bottlenecks incrementally (4 steps),
> > >> there
> > >> >> are
> > >> >> >>> > > already some local POCs for these proposals.
> > >> >> >>> > >
> > >> >> >>> > > I'm looking forward to your feedback. Any suggestions are
> > >> >> appreciated
> > >> >> >>> ~
> > >> >> >>> > >
> > >> >> >>> > > [1]
> > >> >> >>> > >
> > >> >> >>> > >
> > >> >> >>> >
> > >> >> >>>
> > >> >>
> > >>
> >
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I%2Fedit%3Fusp%3Dsharing&amp;data=04%7C01%7C%7C13a290f5a7384113903908d8b2b2b61c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637455827910723007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8B3RQOYq2Yn1u7ndq18FayMhhdVjCMHPt96PHRn3JqE%3D&amp;reserved=0
> > >> >> >>> > >
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: Re: [DISCUSS] New Flink Writer Proposal

Reply via email to