JeremyXin opened a new issue, #8451:
URL: https://github.com/apache/seatunnel/issues/8451

   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   When I read the files using HdfsFile as Source, I found that according to 
the output log, some subtasks were assigned multiple files, while the remaining 
subtasks were not assigned files. The result of this allocation is that some 
subtasks are idle and do not process file reads, and some subtasks need to 
process multiple file reads, resulting in performance degradation. The log 
output after the sensitive hdfs path information is deleted is as follows:
   
   2025-01-02 17:04:34,572 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - SubTask 0 is assigned to [hdfs://xxx,hdfs://xxx,hdfs://xxx]
        2025-01-02 17:04:34,573 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - Assigned splits to reader
        2025-01-02 17:04:34,573 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - SubTask 1 is assigned to []
        2025-01-02 17:04:34,573 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - Assigned splits to reader [2]
        2025-01-02 17:04:34,574 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - SubTask 2 is assigned to []
        ... (all assigned to [])
        2025-01-02 17:04:34,577 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - Assigned splits to reader [9]
        2025-01-02 17:04:34,577 INFO  [s.c.s.f.s.BaseFileSourceReader] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=50002}] - Closed the bounded File source
        2025-01-02 17:04:34,578 INFO  [.s.s.FileSourceSplitEnumerator] 
[BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, 
taskGroupId=1}] - SubTask 9 is assigned to [hdfs://xxx] 
       
       
       
       After analyzing the source code, I found that the existing file 
allocation algorithm is randomly allocated according to the file path hashcode 
and parallelism redundancy. In my opinion, is it possible to use the round 
polling file allocation algorithm to ensure that the file load of each SubTask 
is balanced, so as to improve the processing performance?
   
   ### Usage Scenario
   
   This feature can be used to improve file processing performance when the 
connector is file
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to