Re: [I] Parallelize `infer_schema` [datafusion]

via GitHub Mon, 16 Mar 2026 07:23:25 -0700


Dandandan commented on issue #19970:
URL: https://github.com/apache/datafusion/issues/19970#issuecomment-4068033654


   > Hi [@Dandandan](https://github.com/Dandandan) 
[@alamb](https://github.com/alamb), is this still being actively pursued? We 
have a use case that would benefit from this — inferring schemas from complex 
nested JSON files can be quite slow today, especially when there are many files 
or deeply nested structures. Faster schema inference would meaningfully improve 
our workflow. Happy to contribute or help test if this is still moving forward!
   
   Go ahead, I think it's a nice issue. Did you also see (perhaps you can help 
reviewing?):
   
   https://github.com/apache/arrow-rs/pull/9494
   
   One other thing I saw that also might be worth looking at is that we always 
create > 32 threads when doing metadata reading based on the configuration. Not 
a huge problem per-se but it adds to the memory usage and probably reduces 
locality a bit while running queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Parallelize `infer_schema` [datafusion]

Reply via email to