Re: [I] Parallelize `list_files_for_scan` [datafusion]

via GitHub Mon, 26 Jan 2026 12:32:26 -0800


Dandandan commented on issue #19971:
URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3801621664


   I am using samply as tool which does sample-based CPU profiling 
https://github.com/apache/datafusion/blob/6524d91938d2ea6c764edd1a2bc3fd4c98cfcc9c/docs/source/library-user-guide/profiling.md#profiling-using-samply-cross-platform-profiler
   
   The example of the screenshot is just one query (5) of 
`clickbench_partitioned` (which has 100 files).
   
   I agree there is probably not much to be added re: listing of objects, but 
the heavy part (when running it locally agains a number of files) is actually 
the CPU part: deserializing/converting/merging/... Parquet metadata + 
statistics, which is also done in `list_files_for_scan`.
   
   Moving this to use a number of threads should at least spread the work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Parallelize `list_files_for_scan` [datafusion]

Reply via email to