luoyuxia commented on PR #21645:
URL: https://github.com/apache/flink/pull/21645#issuecomment-1471204585

   Here, I would like to shard my idea for it.
   First of all, we can pass a iterator to `SimpleSplitAssigner`, then in 
method `SimpleSplitAssigner#getNext`, we advance the iterator and then return a 
`FileSourceSplit`. In such way, we won't need to iterate the iterator to 
maintain all splits in memory at begining.
   But it'll need to modify public interface and more worse, it will make 
`LocalityAwareSplitAssigner` impossible. Because we only pass an iterator to 
`LocalityAwareSplitAssigner` and then in `#getNext` method, we will always 
return the next split by advancing the iterator, as a result of which, we can't 
make use of locality aware.
   
   Considering locality aware may be import and it will also touch public 
interfaces,  I think the more pratical way will be modify the logic for splits 
enumeration/assignment in Hive connector with touching any public interface.
   That's saying we can do such changes in Hive connector:
   
   We have `StaticFileSplitEnumerator` which will get all splits at once, but 
it's not good when there're so many splits. So I think we can introduce 
`StaticPartitionFileSplitEnumerator`. In `StaticPartitionFileSplitEnumerator`,  
we maintains `remainingPartitions`  and others any thing we may need.
   
   In method `handleSplitRequest`, we call `splitAssigner.getNext(hostname)` to 
assign a split, if it return empty, 
   we then pop a partition from `remainingPartitions`, and get the splits from 
the partition, and create a new splitAssigner with the splits, so on.
   
   `StaticPartitionFileSplitEnumerator` is more like something between 
`ContinuousHivePendingSplitsCheckpoint` and `StaticFileSplitEnumerator`.
   
   Also, please remeber to implement to a `SplitsCheckpoint`  for 
StaticFileSplitEnumerator` like `ContinuousHivePendingSplitsCheckpoint`/ 
`PendingSplitsCheckpoint` to make sure it can restore correctly when using 
`StaticPartitionFileSplitEnumerator`.
   
   For implementation, I would to advice to open a pull request first to 
introduce such `StaticPartitionFileSplitEnumerator`;
   and then make HiveSource use the `StaticPartitionFileSplitEnumerator`.
   
   WDYT? @WencongLiu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to