alamb commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3699162074
> Could we also use a vectorized approach for distribution statistics? I think we should be able to store them as a union of structs and use udfs to compute interesections, etc.? > > For sets statistics, at least the `HashSet<ScalarValue>` type we could have a simple size based heuristic: in my experience these sorts of statistics are most useful when the sets are small. Larger sets are less useful and much more expensive to manage, i.e. cardinality of 1 vs 1M is useful, 1M vs. 2M less so. So maybe we cap it at 128 elements or something like that and drop it / stop building it beyond that? I agree that the value of a set distribution is low for many members. Maybe we could convert the set to a min/max range after a few values > I imagine for larger sets estimated set sizes and membership would be more useful, e.g. a bloom filter. 100% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
