mwinkels commented on PR #32: URL: https://github.com/apache/flink-connector-gcp-pubsub/pull/32#issuecomment-3062000817
Hi @clmccart , We are trying to use this code with our Flink setup and we found a couple of issues. First, we are migrating to Flink v2.0 and had to adjust the code - mostly remove the classes that are no longer supported. This is what we found: 1. the streaming pull works fine and with Flink v2.0 job restarts are much faster, which leads to better performance overall. 2. the `PubSubSplitEnumerator` could be completely stateless. GCP PubSub does not support splits that can be referenced externally, so there is no need to maintain any state on the connector side for assigning splits to readers. 3. the `SubscriptionSplit`-state is similarly not really necessary, since there is only one subscription for a source and there is no meaningful split. The current implementation is harmful, since it stores the task-id in the split and when it recovers, there is a potential of re-creating subscriptions that are not part of an active reader. We remove the `uid` field from the `SubscriptionSplitProto` to fix this issue. 4. the `PubSubSplitReader` is more complex then it needs to be. A SplitReader instance is always bound to a single source task, so it only ever needs a single subscription for the single split that it will received. The `Map<Split, Subscription>` that is currently in the `PubSubSplitReader` should only ever hold one instance. We replaced it with an instance field of type `NotifyingPullSubscriber`. I hope these findings will help to improve this MR and get it to the finish line. Best, -Maarten Winkels P.S.: I'm not sure how to contribute changes back, since they are Flink v2 based. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
