alamb commented on a change in pull request #1291:
URL: https://github.com/apache/arrow-datafusion/pull/1291#discussion_r764051498
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -909,31 +913,16 @@ fn equal_rows(
// Produces a batch for left-side rows that have/have not been matched during
the whole join
fn produce_from_matched(
- visited_left_side: &[bool],
+ visited_left_side: &BooleanBufferBuilder,
schema: &SchemaRef,
column_indices: &[ColumnIndex],
left_data: &JoinLeftData,
unmatched: bool,
) -> ArrowResult<RecordBatch> {
- // Find indices which didn't match any right row (are false)
- let indices = if unmatched {
- UInt64Array::from_iter_values(
- visited_left_side
- .iter()
- .enumerate()
- .filter(|&(_, &value)| !value)
- .map(|(index, _)| index as u64),
- )
- } else {
- // produce those that did match
- UInt64Array::from_iter_values(
- visited_left_side
- .iter()
- .enumerate()
- .filter(|&(_, &value)| value)
- .map(|(index, _)| index as u64),
- )
- };
+ let indices =
+
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
+ (unmatched ^ visited_left_side.get_bit(v)).then(|| v as u64)
+ }));
Review comment:
I ran the performance tests against this branch (merged to master) and
master:
For q13 (which has a `left join`): at TPCH Scale Factor (SF) 10 (aka around
10GB of data), on a GCP 16core 64GB RAM machine, running Ubuntu 20.04:
My results show no measureable difference
Coverted to parquet via:
```shell
cargo run --release --bin tpch -- convert --input ./data --output
./tpch-parquet --format parquet
```
Then benchmarked using
```shell
cargo run --release --bin tpch -- benchmark datafusion --mem-table --format
parquet --path ./tpch-parquet --query 13
```
On master:
```
Query 13 iteration 0 took 10017.1 ms
Query 13 iteration 1 took 10638.8 ms
Query 13 iteration 2 took 10010.3 ms
Query 13 avg time: 10222.08 ms
```
On this branch merged to master:
```shell
git checkout boazberman/master
git merge origin/master
```
```
Query 13 iteration 0 took 10438.6 ms
Query 13 iteration 1 took 10409.1 ms
Query 13 iteration 2 took 10030.8 ms
Query 13 avg time: 10292.82 ms
```
When I ran the same test again a few times, it reported `avg` times with
sigificant deviation
```
...
Query 13 avg time: 10750.95 ms
...
Query 13 avg time: 10325.13 ms
...
Query 13 avg time: 10460.80 ms
```
This leads me to conclude the very small reported difference is noise.
Note that Q13 is:
```sql
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey)
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
) as c_orders (c_custkey, c_count)
group by
c_count
order by
custdist desc,
c_count desc;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]