jorgecarleitao opened a new pull request #8709:
URL: https://github.com/apache/arrow/pull/8709


   This PR is based on top of #8630 and contains a physical node to perform an 
inner join in DataFusion.
   
   This is still a draft, but IMO the design is here and the two tests already 
pass.
   
   This is co-authored with @andygrove , that contributed to the design on how 
to perform this operation in the context of DataFusion (see ARROW-9555 for 
details).
   
   The API used for the computation of the join at the arrow level is briefly 
discussed in [this 
document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit).
   
   There is still a lot to work on, but I I though it would be a good time to 
have a first round of discussions, and also to gauge timings wrt to the 3.0 
release.
   
   There are two main issues being addressed in this PR:
   
   * How to we perform the join at the partition level: this pr collects all 
batches from the left, and then issues a stream per part on the right. Each 
batch on that stream joins itself with all the ones from the left (N) via a 
hash. This allow us to only require computing the hash of a row once (first all 
the left, then one by one on the right).
   
   * How do we build an array from `N (left)` arrays and a set of indices 
(matching the hash from the right): this is done using the `MutableArrayData` 
being worked on #8630, which incrementally memcopies slots from each of the N 
arrays based on the index. This implementation is useful because it works for 
all array types and does not require casting anything to rust native types 
(i.e. it operates on `ArrayData`, not specific implementations).
   
   There are still some steps left to have a join in SQL, most notably the 
whole logical planning, the output_partition logic, the bindings to SQL and 
DataFrame API, update the optimizers to handle nodes with 2 children,  and a 
whole battery of tests.
   
   There is also a natural path for the other joins, as it will be a matter of 
incorporating the work already on PR #8689 that introduces the option to extend 
the `MutableArrayData` with nulls, the operation required for left and right 
joins.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to