luoyuxia commented on PR #21645: URL: https://github.com/apache/flink/pull/21645#issuecomment-1471204585
Here, I would like to shard my idea for it. First of all, we can pass a iterator to `SimpleSplitAssigner`, then in method `SimpleSplitAssigner#getNext`, we advance the iterator and then return a `FileSourceSplit`. In such way, we won't need to iterate the iterator to maintain all splits in memory at begining. But it'll need to modify public interface and more worse, it will make `LocalityAwareSplitAssigner` impossible. Because we only pass an iterator to `LocalityAwareSplitAssigner` and then in `#getNext` method, we will always return the next split by advancing the iterator, as a result of which, we can't make use of locality aware. Considering locality aware may be import and it will also touch public interfaces, I think the more pratical way will be modify the logic for splits enumeration/assignment in Hive connector with touching any public interface. That's saying we can do such changes in Hive connector: We have `StaticFileSplitEnumerator` which will get all splits at once, but it's not good when there're so many splits. So I think we can introduce `StaticPartitionFileSplitEnumerator`. In `StaticPartitionFileSplitEnumerator`, we maintains `remainingPartitions` and others any thing we may need. In method `handleSplitRequest`, we call `splitAssigner.getNext(hostname)` to assign a split, if it return empty, we then pop a partition from `remainingPartitions`, and get the splits from the partition, and create a new splitAssigner with the splits, so on. `StaticPartitionFileSplitEnumerator` is more like something between `ContinuousHivePendingSplitsCheckpoint` and `StaticFileSplitEnumerator`. Also, please remeber to implement to a `SplitsCheckpoint` for StaticFileSplitEnumerator` like `ContinuousHivePendingSplitsCheckpoint`/ `PendingSplitsCheckpoint` to make sure it can restore correctly when using `StaticPartitionFileSplitEnumerator`. For implementation, I would to advice to open a pull request first to introduce such `StaticPartitionFileSplitEnumerator`; and then make HiveSource use the `StaticPartitionFileSplitEnumerator`. WDYT? @WencongLiu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org