[ https://issues.apache.org/jira/browse/FLINK-31192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-31192: ----------------------------------- Labels: pull-request-available (was: ) > dataGen takes too long to initialize under sequence > --------------------------------------------------- > > Key: FLINK-31192 > URL: https://issues.apache.org/jira/browse/FLINK-31192 > Project: Flink > Issue Type: Improvement > Affects Versions: 1.17.0, 1.15.3, 1.16.1 > Reporter: xzw0223 > Assignee: xzw0223 > Priority: Major > Labels: pull-request-available > > The SequenceGenerator preloads all sequence values in open. If the > totalElement number is too large, it will take too long. > [https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/datagen/SequenceGenerator.java#L91] > The reason is that the capacity of the Deque will be expanded twice when the > current capacity is full, and the array copy is required, which is > time-consuming. > > Here's what I think : > do not preload the full amount of data on Sequence, and generate a piece of > data each time next is called to solve the problem of slow initialization > caused by loading full amount of data. > record the currently sent Sequence position through the checkpoint, and > continue to send data through the recorded position after an abnormal restart > to ensure fault tolerance -- This message was sent by Atlassian Jira (v8.20.10#820010)