RyanJamesStewart opened a new pull request, #9967:
URL: https://github.com/apache/arrow-rs/pull/9967

   ## Which issue does this PR close?
   
   Contributes to #9731.
   
   ## Rationale for this change
   
   When writing a nullable leaf (primitive) Arrow array, `write_leaf` builds 
the definition-level buffer one element at a time, mapping each null bit to a 
level. For columns that are mostly null this does ~`num_rows` of branchy work 
and allocates a `num_rows`-element level buffer even though almost every 
produced level is the same value. #9954 adds an O(1) fast path for the 
*entirely* null case; this PR covers the *sparse* (mostly-but-not-entirely 
null) case it doesn't handle — the literal subject of #9731 ("a column that is 
99% null … ~100x more work than necessary").
   
   ## What changes are included in this PR?
   
   A single popcount pass over the null mask (`Buffer::count_set_bits_offset`, 
O(`num_rows`/64)) counts the valid values in the range. When the slice is 
majority-null, the definition-level buffer is bulk-filled with the null level 
(a vectorized `Vec::resize` memset) and only the non-null positions (from 
`BitIndexIterator`) are overwritten. The existing per-row path is kept for 
non-majority-null slices, so balanced and null-light columns are unaffected. 
Output is byte-identical — the level *values* are unchanged, just produced via 
memset+scatter instead of a per-row loop. The non-null index reservation is 
right-sized (`len - valid` instead of `len`). One file, +32/−6.
   
   ## Are these changes tested?
   
   Existing tests cover this path: `cargo test -p parquet --features arrow 
--lib arrow_writer` — 136 tests, full of nulls and roundtrips — green; full 
`cargo test -p parquet --features arrow` green modulo the pre-existing 
`PARQUET_TEST_DATA` submodule failures (unrelated — same on `main`). `cargo 
clippy -p parquet --features arrow --lib` clean. The `unsafe get_unchecked_mut` 
in the scatter loop is guarded by a `debug_assert!` of the bounds invariant.
   
   ## Are there any user-facing changes?
   
   None.
   
   ## Benchmarks
   
   `cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable 
primitive columns:
   
   ```
   primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   ← the 
case #9731 calls out
   primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   
(subsumed by #9954's O(1) path if that lands first)
   struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
   struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
   list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, 
bool, string:  within noise (no regression)
   ```
   
   Microbench of the definition-level fill in isolation: 10.3× @ 100%-null, 
8.6× @ 99%, 5.2× @ 90%, 1.9× @ 50%, 0.93× @ 10%, 0.81× @ 0% — crossover ≈ 
12–15% null, clean win above ~25%; the `>= 50% null` guard is conservative.
   
   This is the *materialization*-cost half of #9731 (~30% of the 99%-null 
write); the *walk*-cost half — a run-length input to the level encoder so the 
column writer doesn't even iterate all `num_rows` levels — is the larger 
structural change #9653 is heading toward, and this is deliberately small and 
isolated so it lands independently of and rebases cleanly under that work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to