RyanJamesStewart opened a new pull request, #9967:
URL: https://github.com/apache/arrow-rs/pull/9967
## Which issue does this PR close?
Contributes to #9731.
## Rationale for this change
When writing a nullable leaf (primitive) Arrow array, `write_leaf` builds
the definition-level buffer one element at a time, mapping each null bit to a
level. For columns that are mostly null this does ~`num_rows` of branchy work
and allocates a `num_rows`-element level buffer even though almost every
produced level is the same value. #9954 adds an O(1) fast path for the
*entirely* null case; this PR covers the *sparse* (mostly-but-not-entirely
null) case it doesn't handle — the literal subject of #9731 ("a column that is
99% null … ~100x more work than necessary").
## What changes are included in this PR?
A single popcount pass over the null mask (`Buffer::count_set_bits_offset`,
O(`num_rows`/64)) counts the valid values in the range. When the slice is
majority-null, the definition-level buffer is bulk-filled with the null level
(a vectorized `Vec::resize` memset) and only the non-null positions (from
`BitIndexIterator`) are overwritten. The existing per-row path is kept for
non-majority-null slices, so balanced and null-light columns are unaffected.
Output is byte-identical — the level *values* are unchanged, just produced via
memset+scatter instead of a per-row loop. The non-null index reservation is
right-sized (`len - valid` instead of `len`). One file, +32/−6.
## Are these changes tested?
Existing tests cover this path: `cargo test -p parquet --features arrow
--lib arrow_writer` — 136 tests, full of nulls and roundtrips — green; full
`cargo test -p parquet --features arrow` green modulo the pre-existing
`PARQUET_TEST_DATA` submodule failures (unrelated — same on `main`). `cargo
clippy -p parquet --features arrow --lib` clean. The `unsafe get_unchecked_mut`
in the scatter loop is guarded by a `debug_assert!` of the bounds invariant.
## Are there any user-facing changes?
None.
## Benchmarks
`cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
primitive columns:
```
primitive_sparse_99pct_null/default 11.88 ms -> 9.13 ms (-23%) ← the
case #9731 calls out
primitive_all_null/default 5.65 ms -> 2.33 ms (-59%)
(subsumed by #9954's O(1) path if that lands first)
struct_sparse_99pct_null/default 5.67 ms -> 5.32 ms (-6%)
struct_all_null/default 1.52 ms -> 1.31 ms (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null,
bool, string: within noise (no regression)
```
Microbench of the definition-level fill in isolation: 10.3× @ 100%-null,
8.6× @ 99%, 5.2× @ 90%, 1.9× @ 50%, 0.93× @ 10%, 0.81× @ 0% — crossover ≈
12–15% null, clean win above ~25%; the `>= 50% null` guard is conservative.
This is the *materialization*-cost half of #9731 (~30% of the 99%-null
write); the *walk*-cost half — a run-length input to the level encoder so the
column writer doesn't even iterate all `num_rows` levels — is the larger
structural change #9653 is heading toward, and this is deliberately small and
isolated so it lands independently of and rebases cleanly under that work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]