Daniël Heres created ARROW-10885:
------------------------------------

             Summary: [Rust][DataFusion] Optimize join build vs probe based on 
statistics on row number
                 Key: ARROW-10885
                 URL: https://issues.apache.org/jira/browse/ARROW-10885
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust - DataFusion
            Reporter: Daniël Heres
            Assignee: Daniël Heres


Based on number of rows in a datasource we can optimize which table should be 
part of the build phase and which part of the probe phase in a hash join. We 
should make the (approximately) smallest datasource. This can have a large 
effect on performance if one of the two tables is much bigger than the other, 
as we can skip building a large lookup table.

Recently we are adding statistics to data sources in DataFusion, so this seems 
something we can add relatively easily. We can approximate the number of rows 
based on underlying statistics in datasources, but it should at least work for 
simple cases first.
When swapping the order a left join has to be changed to a right join and vice 
versa, inner joins remain the same. Probably it is easier to start with inner 
joins and then add left / right joins.

Maybe we should also rename some internals to make clear that e.g. the left 
part is part of the build and the right part of the probe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to