hhhizzz commented on PR #10141:
URL: https://github.com/apache/arrow-rs/pull/10141#issuecomment-4732734563

   I did a quick TPC-DS follow-up on this PR to check whether the bitmap-backed
   `RowSelection` change shows any broad regression in DataFusion. This is 
**not**
   a formal 10-round benchmark result; it is a 3-round/debug signal plus 
targeted
   rechecks.
   
   Setup:
   
   - TPC-DS SF10, all 99 queries
   - 3 full rounds, 1 iteration per query
   - CPUSET `0-23`
   - Parquet pushdown filters / reorder filters / pruning all enabled
   - Arrow baseline: PR base `be664614ab`
   - Arrow current: this PR head `7f40e67a`
   - DataFusion base: `ddc157d10e8`, with only Cargo patch wiring changed 
between the two sides
   
   Full 3-round result was effectively neutral:
   
   ```text
   main total:        175786.07 ms
   current total:     175010.49 ms
   geomean current/main: 0.993443
   runner-reported current speedup: 0.660%
   failures: 0
   ```
   
   The round-level direction was not stable:
   
   ```text
   round 1: +2.18% current speedup
   round 2: +6.44% current speedup
   round 3: -6.22% current speedup
   ```
   
   Some queries initially looked slower in the full run, so I rechecked the 
largest
   apparent regressions in isolation with 5 rounds each:
   
   | query | isolated geomean current/main | bucket | range |
   |---:|---:|---|---:|
   | q39 | 1.014365 | neutral | 0.991326-1.057961 |
   | q23 | 1.045529 | neutral | 0.980200-1.171810 |
   | q28 | 0.991207 | neutral | 0.953509-1.051072 |
   | q47 | 1.011234 | neutral | 0.989303-1.030131 |
   | q87 | 0.989210 | neutral | 0.957960-1.017851 |
   
   I also ran `dfbench tpcds --debug` for those queries and compared
   `DataSourceExec` metrics. I did not see evidence of row-pushdown getting 
worse
   on the PR side. For example, summed task metrics showed:
   
   | query/table | processing main -> current | scanning main -> current | 
row_pushdown main -> current |
   |---|---:|---:|---:|
   | q39 `inventory` | 752.4 -> 764.4 ms | 7770.0 -> 7660.0 ms | 74.9 -> 74.5 
ms |
   | q23 `store_sales` | 23310.0 -> 22520.0 ms | 87870.0 -> 84350.0 ms | 
13157.2 -> 12881.9 ms |
   | q28 `store_sales` | 15470.0 -> 14710.0 ms | 81720.0 -> 78300.0 ms | 247.2 
-> 243.8 ms |
   
   Those metrics are summed across tasks, not wall-clock time, but they are 
useful
   for attribution. The main takeaway is that the large per-query slowdowns from
   the 3-round full run did not reproduce in isolated runs and do not appear to
   come from row-pushdown evaluation.
   
   My interpretation is that standard TPC-DS is mostly exercising the existing
   selector-backed `RowSelection::from_filters` path, not the new "caller 
already
   has a `BooleanBuffer`" path. So near-neutral TPC-DS is expected and should 
not
   block this PR by itself.
   
   A few suggestions:
   
   1. Keep the PR focused on first-class bitmap-backed `RowSelection`; this is 
the
      right shape for external indexes / FTS / runtime filters that already 
produce
      a bitmap.
   2. Add or keep direct benchmarks for the existing selector-backed operations
      (`from_filters`, `and_then`, `intersection`, `split_off`, `limit`, 
`offset`)
      because the PR changes the common representation and iterator surface even
      when callers do not use the new bitmap constructor.
   3. Add an end-to-end benchmark where the access plan starts as a 
`BooleanBuffer`,
      with both fragmented and clustered masks. Standard TPC-DS is not a strong
      workload for that new path.
   4. For `Auto` with mask-backed input, consider documenting the current 
"preserve
      mask" behavior or adding a shape gate. The raw representation break-even 
is
      roughly `selector_count > row_count / 128`; fragmented masks strongly 
favor
      bitmap, but clustered masks can still be better as selectors.
   
   Overall, I see the TPC-DS signal as neutral, while the synthetic
   fragmented-selection benchmark still strongly supports the API direction.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to