Sure, i can update the RFC-13 cwiki if you agree with that. Vinoth Chandar <vin...@apache.org> 于2021年1月5日周二 上午2:58写道:
> Overall +1 on the idea. > > Danny, could we move this to the apache cwiki if you don't mind? > That's what we have been using for other RFC discussions. > > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan <danny0...@apache.org> wrote: > > > The RFC-13 Flink writer has some bottlenecks that make it hard to adapter > > to production: > > > > - The InstantGeneratorOperator is parallelism 1, which is a limit for > > high-throughput consumption; because all the split inputs drain to a > single > > thread, the network IO would gains pressure too > > - The WriteProcessOperator handles inputs by partition, that means, > within > > each partition write process, the BUCKETs are written one by one, the > FILE > > IO is limit to adapter to high-throughput inputs > > - It buffers the data by checkpoints, which is too hard to be robust for > > production, the checkpoint function is blocking and should not have IO > > operations. > > - The FlinkHoodieIndex is only valid for a per-job scope, it does not > work > > for existing bootstrap data or for different Flink jobs > > > > Thus, here I propose a new design for the Flink writer to solve these > > problems[1]. Overall, the new design tries to remove the single > parallelism > > operators and make the index more powerful and scalable. > > > > I plan to solve these bottlenecks incrementally (4 steps), there are > > already some local POCs for these proposals. > > > > I'm looking forward to your feedback. Any suggestions are appreciated ~ > > > > [1] > > > > > https://docs.google.com/document/d/1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I/edit?usp=sharing > > >