On Thu, 29 Jan 2026 at 15:54 +0000, Jonathan Wakely wrote:
On Wed, 28 Jan 2026 at 11:27 -0500, Patrick Palka wrote:
Changes in v7:
- Make fallback_next and fallback_rep_once_more code paths fall through
to the next / rep_once_more code paths instead of pushing a frame
that gets immediately popped.
Changes in v6:
- Use an inline namespace to give the _Executor class a different
mangled name since its implementation is no longer compatible with
the previous recursive one
- Explicitly cast _M_subexpr to _StateIdT in _M_handle_subexpr_begin/end
to self-document that what we're storing in _ExecutorFrame::_M_state_id
isn't really a state id
- I experimented with using std::deque instead of std::vector for the
frames stack.
Upsides:
- Reduces peak memory use from 50MB to 33MB for the microbenchmark
regex_match(string(200000, 'a'), "(a|b|c)*")
due to less unused capacity (vector's 2x growth factor means that
50% of the time there will be at least 25% unused capacity)
- Seems to be as fast as vector
- Automatically reclaims unused capacity as the stack unwinds
- No expensive reallocation needed when growing the stack in the
case of non-trivially-copyable input iterators
Downsides:
- deque will allocate/free a chunk of memory whenever crossing a
chunk boundary (512 elements) in either direction, which could
degrade performance in rare cases where our stack size hovers
near a multiple of the chunk boundary.
- the fixed chunk size means small inputs will use at least
512*24 bytes=12kB of memory
Ultimately it seems using std::deque is better for very large
inputs, and for small inputs std::vector is slightly better. I'm
not sure which container to go with, for now I stuck with std::vector.
(N.B. we seem to be 2x faster than libc++'s std::regex and use 60%
less memory for the above microbenchmark)
Another option would be to stick with vector, but manage the capacity
semi-manually. So when emplacing N new frames (where N == 1 or N == 2)
check if we have capacity and if not, call
_M_frames.reserve(_M_frames.size() * 3 / 2 + N);
This would grow by a factor of 1.5 which would result in less overhead
when we actually only need a tiny bit more capacity, and is claimed to
be a better choice anyway:
https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md#memory-handling
Maybe put that inside a _M_reserve_frames member so that any function
which wants to call _M_frames.emplace_back would call
_M_reserve_frames(1) or _M_reserve_frames(2) and inside that function
it would do the check to see if it needs to increase the capacity.
That can be a follow-up patch though, using vector is fine for this
patch.
This _M_reserve_frames function could also end with something like:
if ((_M_frames.capacity() - _M_frames.size()) < __n)
__builtin_unreachable();
which might help the emplace_back calls that follow it to always take
the fast path (because they never need to reallocate) and so inline
those emplace_back calls.
Of course now you've just moved the cold path into _M_frames_reserve
instead, which calls _M_frames.reserve(...). But maybe one call to
reserve that always has to relocate the contents to new storage and
two calls to emplace_back that never have to relocate is more
inlinable than two calls to emplace_back which might *both* need to
relocate, depending on how much available capacity there is before the
first call.