wForget opened a new issue, #4780:
URL: https://github.com/apache/datafusion-comet/issues/4780

   ### What is the problem the feature request solves?
   
   ## Background
   
   Comet's shuffle partitioners currently own too much of the local shuffle 
write logic directly, including writing shuffle data files, writing index 
files, handling spill files, and finalizing partition offsets.
   
   This makes the partitioning logic tightly coupled with the local file-based 
shuffle storage implementation. It also makes it harder to introduce 
alternative shuffle storage backends, such as a remote shuffle writer, because 
each partitioner would need to be updated with backend-specific write behavior.
   
   ## Proposal
   
   Introduce a `ShufflePartitionWriter` / `PartitionWriter` abstraction for 
shuffle partition output.
   
   The partitioners should focus on producing partitioned `RecordBatch` 
streams, while the writer implementation should own the details of how shuffle 
data is stored and finalized.
   
   The initial implementation should move the existing local file-based shuffle 
write behavior into a local partition writer implementation, preserving the 
current behavior for:
   
   - single-partition shuffle
   - multi-partition shuffle
   - empty-schema shuffle
   - spill handling
   - data file and index file generation
   - shuffle write metrics
   
   ## Benefits
   
   This refactor separates shuffle partitioning from shuffle storage, making 
the code easier to extend and maintain.
   
   It also creates a clear extension point for future remote shuffle support. A 
remote shuffle writer can later implement the same writer interface without 
requiring the shuffle partitioners to know whether the output is written to 
local files or to a remote shuffle service.
   
   ## Scope
   
   This issue is intended to be a refactor only. It should preserve the 
existing local shuffle behavior and prepare the codebase for future remote 
shuffle writer support.
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to