[GitHub] [arrow-datafusion] rdettai edited a comment on pull request #1347: collect table stats by default for listing table

GitBox Mon, 22 Nov 2021 01:21:40 -0800


rdettai edited a comment on pull request #1347:
URL: https://github.com/apache/arrow-datafusion/pull/1347#issuecomment-975307896



   Thanks for spotting this @houqp. I plan on working on the default values 
soon, as I find them to be a bit confused. They are sourced from multiple 
places and thus hard to document / understand. I don't mind switching the 
default for `collect_stat` back to true. I tried not to change the default 
behaviors but it seems that this one slipped away.
   
   > For serious production use-cases, collecting stats should make a big 
different in performance for cases where it could help
   
   Sadly things are not that simple 😅. There are some cases where this wouldn't 
be true. The `collect_stats` parameter you are referring to here defines 
whether source level statistics should be fetched **during the planning** (this 
will be an overall statistics, not a file level one). This is meant primarily 
to enable Cost Based Optimizations. In a distributed setup like Ballista, if 
you activate statistics fetching, as the planning is made on the scheduler 
node, it implies that a single node will need to open all the files/objects to 
read the stats. This will be very long if there are many files. It might 
actually be faster to just get the list of files and distribute the work across 
nodes (especially if you have many nodes). If what you want is not necessarily 
to get the source level statistics during the planning but only have them at 
the file level when executing the `ExecutionPlan` to enable row group pruning, 
I think that should be a separate configuration (or maybe we don'
 t even need a configuration, for formats like Parquet we should **always** get 
the file/row_group level statistics and use them).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on pull request #1347: collect table stats by default for listing table

Reply via email to