Hi Yuxia,
Thanks for your feedback.
Comments inline
Le 31/03/2023 à 04:21, yuxia a écrit :
Hi, Etienne.
Thanks for Etienne for sharing this article. I really like it and learn much
from it.
=> Glad it was useful, that was precisely the point :)
I'd like to raise some questions about implementing batch source. Welcome devs
to share insights about them.
The first question is how to generate splits:
As the article mentioned:
"Whenever possible, it is preferable to generate the splits lazily, meaning that
each time a reader asks the enumerator for a split, the enumerator generates one on
demand and assigns it to the reader."
I think it maybe not for all cases. In some cases, generating split may be time
counsuming, then it may be better to generate a batch of splits on demand to
amortize the expense.
But it then raises another question, how many splits should be generated in a
batch, too many maywell cause OOM, too less may not make good use of batch
generating splits.
To solve it, I think maybe we can provide a configuration to make user to
configure how many splits should be generated in a batch.
What's your opinion on it. Have you ever encountered this problem in your
implementation?
=> I agree, lazy splits is not the only way. I've mentioned in the
article that batch generation is another in case of high split
generation cost, thanks for the suggestion. During the implementation I
didn't have this problem as generating a split was not costly, the only
costly processing was the splits preparation. It was run asynchronously
and only once, then each split generation was straightforward. That
being said, during development, I had OOM risks in the size and number
of splits. For the number of splits, lazy generation solved it as no
list of splits was stored in the ennumerator apart from the splits to
reassign. For the size of split I used a user provided max split memory
size similar to what you suggest here. In the batch generation case, we
could allow the user to set a max memory size for the batch : number of
splits in batch looks more dangerous to me if we don't know the size of
a split but if we are talking about storing the split objects and not
their content then that is ok. IMO, memory size is more clear for the
user as it is linked to the memory of a task manager.
The second question is how to assign splits:
What's your split assign stratgy?
=> the naïve one: a reader asks for a split, the enumerator receives the
request, generates a split and assigns it to the demanding reader.
In flink, we provide `LocalityAwareSplitAssigner` to make use of locality to
assign split to reader.
=> well, it has interest only when the data-backend cluster nodes can be
co-localized with Flink task managers right? That would rarely be the
case as clusters seem to be separated most of the time to use the
maximum available CPU (at least for CPU-band workloads) no ?
But it may not perfert for the case of failover
=> Agree: it would require costly shuffle to keep the co-location after
restoration and this cost would not be balanced by the gain raised by
co-locality (mainly avoiding network use) I think.
for which we intend to introduce another split assign strategy[1].
But I do think it should be configurable to enable advanced user to decide
which assign stratgy to use.
=> when you say the "user" I guess you mean user of the source not user
of the dev framework (implementor of the source). I think that it should
be configurable indeed as the user is the one knowing the repartition of
the partitions of the backend data.
Best
Etienne
Welcome other devs to share opinion.
[1]: https://issues.apache.org/jira/browse/FLINK-31065
Also as for split assigner .
Best regards,
Yuxia
----- 原始邮件 -----
发件人: "Etienne Chauchot" <echauc...@apache.org>
收件人: "dev" <dev@flink.apache.org>
抄送: "Chesnay Schepler" <ches...@apache.org>
发送时间: 星期四, 2023年 3 月 30日 下午 10:36:39
主题: [blog article] Howto create a batch source with the new Source framework
Hi all,
After creating the Cassandra source connector (thanks Chesnay for the
review!), I wrote a blog article about how to create a batch source with
the new Source framework [1]. It gives field feedback on how to
implement the different components.
I felt it could be useful to people interested in contributing or
migrating connectors.
=> Can you give me your opinion ?
=> I think it could be useful to post the article to Flink official blog
also if you agree.
=> Same remark on my previous article [2]: what about publishing it to
Flink official blog ?
[1]https://echauchot.blogspot.com/2023/03/flink-howto-create-batch-source-with.html
[2]https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html
Best
Etienne