jihoonson opened a new pull request #8115: Add shuffleSegmentPusher which is a 
dataSegmentPusher used for writin…
URL: https://github.com/apache/incubator-druid/pull/8115
 
 
   This PR is for https://github.com/apache/incubator-druid/issues/8061 and 
based on https://github.com/apache/incubator-druid/pull/8114.
   
   ### Description
   
   `ShuffleDataSegmentPusher` is a dataSegmentPusher used for writing shuffle 
data in local storage. 
   
   `ShuffleDataSegmentPusher` uses `IntermediaryDataManager` internally which 
coordinates the segment writes in a round-robin fashion per supervisor task 
across sub tasks. This is to fully utilize the local disk bandwidth for shuffle.
   
   The middleManager and the indexer can use this. However, with the 
middleManager, each task uses a separate `IntermediaryDataManager` instance. 
This could potentially result in two issues:
   
   - The distribution of shuffle segments can be suboptimal across local 
storage locations.
   - `IntermediaryDataSegment` needs to smoosh segment files into larger ones 
to avoid "too many open files" problem. This could also be an issue if there 
are a lot of tasks since `IntermediaryDataSegment` can't smoosh files across 
tasks with middleManager.
   
   I think this would be ok for now and could be improved if required in the 
future. 
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added unit tests or modified existing tests to cover new code paths.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to