Thanks Etienne for detail explanation. Best regards, Yuxia
----- 原始邮件 ----- 发件人: "Etienne Chauchot" <[email protected]> 收件人: "dev" <[email protected]> 发送时间: 星期五, 2023年 3 月 31日 下午 9:08:36 主题: Re: [blog article] Howto create a batch source with the new Source framework Hi Yuxia, Thanks for your feedback. Comments inline Le 31/03/2023 à 04:21, yuxia a écrit : > Hi, Etienne. > > Thanks for Etienne for sharing this article. I really like it and learn much > from it. => Glad it was useful, that was precisely the point :) > > I'd like to raise some questions about implementing batch source. Welcome > devs to share insights about them. > > The first question is how to generate splits: > As the article mentioned: > "Whenever possible, it is preferable to generate the splits lazily, meaning > that each time a reader asks the enumerator for a split, the enumerator > generates one on demand and assigns it to the reader." > I think it maybe not for all cases. In some cases, generating split may be > time counsuming, then it may be better to generate a batch of splits on > demand to amortize the expense. > But it then raises another question, how many splits should be generated in a > batch, too many maywell cause OOM, too less may not make good use of batch > generating splits. > To solve it, I think maybe we can provide a configuration to make user to > configure how many splits should be generated in a batch. > What's your opinion on it. Have you ever encountered this problem in your > implementation? => I agree, lazy splits is not the only way. I've mentioned in the article that batch generation is another in case of high split generation cost, thanks for the suggestion. During the implementation I didn't have this problem as generating a split was not costly, the only costly processing was the splits preparation. It was run asynchronously and only once, then each split generation was straightforward. That being said, during development, I had OOM risks in the size and number of splits. For the number of splits, lazy generation solved it as no list of splits was stored in the ennumerator apart from the splits to reassign. For the size of split I used a user provided max split memory size similar to what you suggest here. In the batch generation case, we could allow the user to set a max memory size for the batch : number of splits in batch looks more dangerous to me if we don't know the size of a split but if we are talking about storing the split objects and not their content then that is ok. IMO, memory size is more clear for the user as it is linked to the memory of a task manager. > > > The second question is how to assign splits: > What's your split assign stratgy? => the naïve one: a reader asks for a split, the enumerator receives the request, generates a split and assigns it to the demanding reader. > In flink, we provide `LocalityAwareSplitAssigner` to make use of locality to > assign split to reader. => well, it has interest only when the data-backend cluster nodes can be co-localized with Flink task managers right? That would rarely be the case as clusters seem to be separated most of the time to use the maximum available CPU (at least for CPU-band workloads) no ? > But it may not perfert for the case of failover => Agree: it would require costly shuffle to keep the co-location after restoration and this cost would not be balanced by the gain raised by co-locality (mainly avoiding network use) I think. > for which we intend to introduce another split assign strategy[1]. > But I do think it should be configurable to enable advanced user to decide > which assign stratgy to use. => when you say the "user" I guess you mean user of the source not user of the dev framework (implementor of the source). I think that it should be configurable indeed as the user is the one knowing the repartition of the partitions of the backend data. Best Etienne > > > Welcome other devs to share opinion. > > [1]: https://issues.apache.org/jira/browse/FLINK-31065 > > > > > > Also as for split assigner . > > > Best regards, > Yuxia > > ----- 原始邮件 ----- > 发件人: "Etienne Chauchot" <[email protected]> > 收件人: "dev" <[email protected]> > 抄送: "Chesnay Schepler" <[email protected]> > 发送时间: 星期四, 2023年 3 月 30日 下午 10:36:39 > 主题: [blog article] Howto create a batch source with the new Source framework > > Hi all, > > After creating the Cassandra source connector (thanks Chesnay for the > review!), I wrote a blog article about how to create a batch source with > the new Source framework [1]. It gives field feedback on how to > implement the different components. > > I felt it could be useful to people interested in contributing or > migrating connectors. > > => Can you give me your opinion ? > > => I think it could be useful to post the article to Flink official blog > also if you agree. > > => Same remark on my previous article [2]: what about publishing it to > Flink official blog ? > > > [1]https://echauchot.blogspot.com/2023/03/flink-howto-create-batch-source-with.html > > [2]https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html > > > Best > > Etienne
