JeremyXin opened a new issue, #8451: URL: https://github.com/apache/seatunnel/issues/8451
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description When I read the files using HdfsFile as Source, I found that according to the output log, some subtasks were assigned multiple files, while the remaining subtasks were not assigned files. The result of this allocation is that some subtasks are idle and do not process file reads, and some subtasks need to process multiple file reads, resulting in performance degradation. The log output after the sensitive hdfs path information is deleted is as follows: 2025-01-02 17:04:34,572 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 0 is assigned to [hdfs://xxx,hdfs://xxx,hdfs://xxx] 2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader 2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 1 is assigned to [] 2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [2] 2025-01-02 17:04:34,574 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 2 is assigned to [] ... (all assigned to []) 2025-01-02 17:04:34,577 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [9] 2025-01-02 17:04:34,577 INFO [s.c.s.f.s.BaseFileSourceReader] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=50002}] - Closed the bounded File source 2025-01-02 17:04:34,578 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 9 is assigned to [hdfs://xxx] After analyzing the source code, I found that the existing file allocation algorithm is randomly allocated according to the file path hashcode and parallelism redundancy. In my opinion, is it possible to use the round polling file allocation algorithm to ensure that the file load of each SubTask is balanced, so as to improve the processing performance? ### Usage Scenario This feature can be used to improve file processing performance when the connector is file ### Related issues _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
