[ https://issues.apache.org/jira/browse/ARROW-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove updated ARROW-9555: ------------------------------ Description: Here is an overview of how I think we should implement support for equijoins, at least for the initial implementation. * Read all batches from the left-side of the join into a single Vec<RecordBatch> * Create a map something like HashMap<Vec<ScalarValue>, Vec<(usize,usize)>> to map keys to batch/row indices * Iterate over this Vec<RecordBatch> and create an entry in a hash map, mapping the join keys to the index of the batch and row in the Vec<RecordBatch> * For each input partition on the right-side of the join, return an output partition that is an iterator/stream that: ** For each input row, evaluate the join keys ** Look up those join keys in the hash map ** If a match is found: *** For each (batch, row) index create an output row which has the values from both the left and right row and emit it ** If no match is found: *** Do not emit a row > Add inner (hash) join physical plan > ----------------------------------- > > Key: ARROW-9555 > URL: https://issues.apache.org/jira/browse/ARROW-9555 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion > Reporter: Jorge Leitão > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Here is an overview of how I think we should implement support for equijoins, > at least for the initial implementation. > * Read all batches from the left-side of the join into a single > Vec<RecordBatch> > * Create a map something like HashMap<Vec<ScalarValue>, Vec<(usize,usize)>> > to map keys to batch/row indices > * Iterate over this Vec<RecordBatch> and create an entry in a hash map, > mapping the join keys to the index of the batch and row in the > Vec<RecordBatch> > * For each input partition on the right-side of the join, return an output > partition that is an iterator/stream that: > ** For each input row, evaluate the join keys > ** Look up those join keys in the hash map > ** If a match is found: > *** For each (batch, row) index create an output row which has the values > from both the left and right row and emit it > ** If no match is found: > *** Do not emit a row -- This message was sent by Atlassian Jira (v8.3.4#803005)