[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1291: Left join could use bitmap for left join instead of Vec

GitBox Tue, 07 Dec 2021 06:32:50 -0800


alamb commented on a change in pull request #1291:
URL: https://github.com/apache/arrow-datafusion/pull/1291#discussion_r764051498




##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -909,31 +913,16 @@ fn equal_rows(
 
 // Produces a batch for left-side rows that have/have not been matched during 
the whole join
 fn produce_from_matched(
-    visited_left_side: &[bool],
+    visited_left_side: &BooleanBufferBuilder,
     schema: &SchemaRef,
     column_indices: &[ColumnIndex],
     left_data: &JoinLeftData,
     unmatched: bool,
 ) -> ArrowResult<RecordBatch> {
-    // Find indices which didn't match any right row (are false)
-    let indices = if unmatched {
-        UInt64Array::from_iter_values(
-            visited_left_side
-                .iter()
-                .enumerate()
-                .filter(|&(_, &value)| !value)
-                .map(|(index, _)| index as u64),
-        )
-    } else {
-        // produce those that did match
-        UInt64Array::from_iter_values(
-            visited_left_side
-                .iter()
-                .enumerate()
-                .filter(|&(_, &value)| value)
-                .map(|(index, _)| index as u64),
-        )
-    };
+    let indices =
+        
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
+            (unmatched ^ visited_left_side.get_bit(v)).then(|| v as u64)
+        }));

Review comment:
       I ran the performance tests against this branch (merged to master) and 
master:
   
   For q13 (which has a `left join`): at TPCH Scale Factor (SF) 10 (aka around 
10GB of data), on a GCP 16core 64GB RAM machine, running Ubuntu 20.04:
   
   My results show no measureable difference
   
   
   
   
   Coverted to parquet via:
   
   ```shell
   cargo run --release --bin tpch -- convert --input ./data --output 
./tpch-parquet --format parquet
   ```
   
   Then benchmarked using
   ```shell
   cargo run --release --bin tpch -- benchmark datafusion --mem-table --format 
parquet --path ./tpch-parquet --query 13
   ```
   
   On master:
   
   ```
   Query 13 iteration 0 took 10017.1 ms
   Query 13 iteration 1 took 10638.8 ms
   Query 13 iteration 2 took 10010.3 ms
   Query 13 avg time: 10222.08 ms
   
   ```
   
   On this branch merged to master:
   ```shell
   git checkout boazberman/master
   git merge origin/master
   ```
   
   ```
   Query 13 iteration 0 took 10438.6 ms
   Query 13 iteration 1 took 10409.1 ms
   Query 13 iteration 2 took 10030.8 ms
   Query 13 avg time: 10292.82 ms
   ```
   
   When I ran the same test again a few times, it reported `avg` times with 
sigificant deviation
   ```
   ...
   Query 13 avg time: 10750.95 ms
   ...
   Query 13 avg time: 10325.13 ms
   ...
   Query 13 avg time: 10460.80 ms
   ```
   
   This leads me to conclude the very small reported difference is noise.
   
   
   Note that Q13 is:
   ```sql
   
   select
       c_count,
       count(*) as custdist
   from
       (
           select
               c_custkey,
               count(o_orderkey)
           from
               customer left outer join orders on
                           c_custkey = o_custkey
                       and o_comment not like '%special%requests%'
           group by
               c_custkey
       ) as c_orders (c_custkey, c_count)
   group by
       c_count
   order by
       custdist desc,
       c_count desc;
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1291: Left join could use bitmap for left join instead of Vec

Reply via email to