adriangb commented on PR #17337: URL: https://github.com/apache/datafusion/pull/17337#issuecomment-3426761928
> Thank you @adriangb -- I just reviewed this PR again but I am confused. > > Why are we introducing preferred sorts into the LogicalPlan? > > I thought the eventual idea was to pass the preferred sorts into the DataSource exec so it could order files within a partition appropriately. However, I don't see any code to actually reorder files and I don't understand how it would work with information on the logical plan > > It seems like the code that does reorder files is here in the ExecutionPlan: > > https://github.com/apache/datafusion/blob/d19bf524e384bc24e509c70f1806b6f330829529/datafusion/datasource/src/file.rs#L93-L108 > > Won't we have to thread the preferred sort to there? That is the end goal. The ordering of files currently happens in `ListingTableProvider` (and I think `TableProvider` in general is a good place to do it anyway), so having that sorting information available there is the goal. Once we have this info available I will make a followup PR (just to keep this one smaller) where I use that information to sort the files. Some code pointers: https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/catalog-listing/src/table.rs#L450-L476 https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/datasource/src/file_scan_config.rs#L851-L872 My previous PR that started down this path but got too big and made me think that we should break it up: https://github.com/apache/datafusion/pull/17273 The execution plan does then try to repartition again as you say but it passes in any existing ordering and tries to preserve it: https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/datasource/src/source.rs#L248 https://github.com/apache/datafusion/blob/b5b7f9b356a5363a71ee1d293df199c9e33206cd/datafusion/datasource/src/file.rs#L107 I'm guessing those two things must be compatible already - otherwise the existing implementation would be totally broken, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
