alamb commented on PR #21882:
URL: https://github.com/apache/datafusion/pull/21882#issuecomment-4412430851

   > @alamb I opened this draft PR to get early feedback on the architecture.
   > 
   > 1. The first point is around the sync read path. I introduced 
`open_sync_reader` because `SortMergeJoin` currently has synchronous, blocking 
code paths that directly open files using paths and `BufReader`, instead of 
going through the spill abstractions. Converting this to fully async would 
significantly increase the scope of this PR.
   >    
   >    * Does it make sense to keep this escape hatch for now and handle 
making these operators async in a follow-up PR?
   
   Kind of, though it seems like accumulating technical debt as we'll have APIs 
that will not be needed once we complete the work for SortMergeJoin
   
   What do you think about making a first PR to migrate SortMergeJoin to use 
the spill abstraction?
   
   > 2. The second point is regarding test failures. I have not modified the 
original 64B limit in the tests because I wanted guidance here. Currently, the 
`repartition` test in `mod.rs` is failing, and it seems related to spilling not 
being triggered correctly, the new `SpillWriteAdapter` adds slight allocation 
overhead which makes the original 64-byte memory limit too tight for the merge 
heap to initialize (~296 bytes needed), bumping up the memory limit causes the 
test to not spill anymore, I believe increasing the test data size might solve 
the issue, but am not sure.
   
   Makes sense to me
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to