The RFC-13 Flink writer has some bottlenecks that make it hard to adapter to production:
- The InstantGeneratorOperator is parallelism 1, which is a limit for high-throughput consumption; because all the split inputs drain to a single thread, the network IO would gains pressure too - The WriteProcessOperator handles inputs by partition, that means, within each partition write process, the BUCKETs are written one by one, the FILE IO is limit to adapter to high-throughput inputs - It buffers the data by checkpoints, which is too hard to be robust for production, the checkpoint function is blocking and should not have IO operations. - The FlinkHoodieIndex is only valid for a per-job scope, it does not work for existing bootstrap data or for different Flink jobs Thus, here I propose a new design for the Flink writer to solve these problems[1]. Overall, the new design tries to remove the single parallelism operators and make the index more powerful and scalable. I plan to solve these bottlenecks incrementally (4 steps), there are already some local POCs for these proposals. I'm looking forward to your feedback. Any suggestions are appreciated ~ [1] https://docs.google.com/document/d/1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I/edit?usp=sharing