adriangb commented on PR #17337:
URL: https://github.com/apache/datafusion/pull/17337#issuecomment-3426761928

   > Thank you @adriangb -- I just reviewed this PR again but I am confused.
   > 
   > Why are we introducing preferred sorts into the LogicalPlan?
   > 
   > I thought the eventual idea was to pass the preferred sorts into the 
DataSource exec so it could order files within a partition appropriately. 
However, I don't see any code to actually reorder files and I don't understand 
how it would work with information on the logical plan
   > 
   > It seems like the code that does reorder files is here in the 
ExecutionPlan:
   > 
   > 
https://github.com/apache/datafusion/blob/d19bf524e384bc24e509c70f1806b6f330829529/datafusion/datasource/src/file.rs#L93-L108
   > 
   > Won't we have to thread the preferred sort to there?
   
   That is the end goal. The ordering of files currently happens in 
`ListingTableProvider` (and I think `TableProvider` in general is a good place 
to do it anyway), so having that sorting information available there is the 
goal. Once we have this info available I will make a followup PR (just to keep 
this one smaller) where I use that information to sort the files.
   
   Some code pointers:
   
   
https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/catalog-listing/src/table.rs#L450-L476
   
   
https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/datasource/src/file_scan_config.rs#L851-L872
   
   My previous PR that started down this path but got too big and made me think 
that we should break it up: https://github.com/apache/datafusion/pull/17273
   
   The execution plan does then try to repartition again as you say but it 
passes in any existing ordering and tries to preserve it:
   
   
https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/datasource/src/source.rs#L248
   
   
https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/datasource/src/file.rs#L107
   
   I'm guessing those two things must be compatible already - otherwise the 
existing implementation would be totally broken, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to