adriangb opened a new pull request, #10068:
URL: https://github.com/apache/arrow-rs/pull/10068

   # Which issue does this PR close?
   
   Follow-up to benchmark noise observed on the criterion bench bot (e.g. on 
#9972), where `string/parquet_2` reported a ~1.75x "regression" that was not 
reproducible and not present in instruction counts.
   
   # Rationale for this change
   
   The `arrow_writer` benchmarks build a fresh `ArrowWriter` every criterion 
iteration, so the writer's internal encode buffers are allocated and freed on 
each iteration. With a page-decaying allocator (glibc default, jemalloc 
default), those buffers are served from fresh, un-faulted pages whenever 
earlier benchmarks in the same process have churned the heap — so each 
iteration pays a per-page **minor page fault** on every byte written.
   
   That fault tax roughly doubles the measured time for the byte-array writers 
and makes the result **depend on benchmark order**. On the same hardware as the 
bench bot (Neoverse-V2), the *same* `main` binary produces:
   
   | `string/parquet_2` | time |
   |---|---|
   | run in isolation | ~106 ms |
   | run after the `primitive` group | ~187 ms |
   
   This is the source of the spurious bench-bot deltas: a `main`-vs-`main` 
control run (identical code on both sides) reproduced an **18%** difference on 
`string/parquet_2`, and a larger draw produced the original ~1.75x. The work 
done is identical (instruction count differs by ~0.25% for the change that 
triggered the investigation) — only the page-fault state differs.
   
   Diagnosis details: the slow basin shows ~5M minor faults vs ~763K in the 
fast basin; forcing every buffer onto fresh pages (`MALLOC_MMAP_THRESHOLD_` 
low) pins it slow, and disabling page decay pins it fast.
   
   # What changes are included in this PR?
   
   Use jemalloc as the `arrow_writer` bench's global allocator with page decay 
disabled (`dirty_decay_ms:-1,muzzy_decay_ms:-1`), so freed pages stay mapped 
and are reused warm instead of being returned to the OS. This removes the 
per-iteration fault tax and collapses the order-dependent bimodality:
   
   | `string/parquet_2` | isolated | after `primitive` | after `string` group |
   |---|---|---|---|
   | before (system alloc) | 106 ms | 187 ms | 106 ms |
   | after (this PR) | ~106 ms | ~107 ms | ~106 ms |
   
   Notes on robustness (this came up in review):
   
   - The decay policy is **pinned by the benchmark**, not left to an allocator 
default — via a compiled-in `malloc_conf` symbol — so it does not silently 
change if the allocator updates its defaults.
   - jemalloc only reads the *unprefixed* `malloc_conf` symbol when built with 
`unprefixed_malloc_on_supported_platforms`; without it the symbol is silently 
ignored. To make that failure mode loud, `assert_page_decay_disabled()` reads 
`opt.dirty_decay_ms` / `opt.muzzy_decay_ms` at startup (via 
`tikv-jemalloc-ctl`) and panics if the policy is not actually `-1`, with a 
hint. This was verified to fire when the feature is removed.
   
   Scope: the allocator only affects the `arrow_writer` benchmark binary; no 
library code changes.
   
   # Are there any user-facing changes?
   
   No. Benchmark-only change (dev-dependencies + the `arrow_writer` bench).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to