On Fri, Jul 8, 2022 at 7:15 PM Ivan Zhakov <[email protected]> wrote:
>
> On Fri, 8 Jul 2022 at 19:58, Yann Ylavic <[email protected]> wrote:
> >
> > The main downside is:
> >
> > > +#ifndef AP_PCRE_STACKBUF_SIZE
> > > +#define AP_PCRE_STACKBUF_SIZE (256)
> > >  #endif
> > []
> > >  AP_DECLARE(int) ap_regexec(const ap_regex_t *preg, const char *string,
> > >                             apr_size_t nmatch, ap_regmatch_t *pmatch,
> > >                             int eflags)
> > > -    int small_vector[POSIX_MALLOC_THRESHOLD * 3];
> > > +    /* Use apr_uint64_t to get proper alignment. */
> > > +    apr_uint64_t stack_buf[(AP_PCRE_STACKBUF_SIZE + sizeof(apr_uint64_t) 
> > > - 1) / sizeof(apr_uint64_t)];
> >
> > So, if I count correctly, that's >2KB stack usage (regardless of PCRE1
> > or PCRE2) compared to the 120B we had previously (for PCRE1 only)?
> >
> AP_PCRE_STACKBUF_SIZE is in bytes. And by default is 256 bytes. So
> code allocates exactly 256 bytes on stack.

Argh, so I can't count either.

Let me explain (maybe), I thought that PCRE2's match_data were not
only containing the captures vector but also all the frames needed for
matching not-too-complex regexes (at least the initial ones since they
will be reallocated on need). So the (supposed) 2KB looked necessary
to me for a few stack frames.
But the actual 256 bytes only made me look at the PCRE2 code again
(better this time) for how that can possibly work, and guess what: the
initial frames are allocated on the stack ([1]) still in PCRE2, and
that's 20KiB ([2]) of stack space for each call to pcre2_match().

On Alpine Linux for instance (quite used in docker envs) there is only
128KiB stack per thread, so I've always configured
--no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
vector cache, with custom pcre_malloc/free() to avoid performances
overhead and be able to run more complex regexes without exploding,
and I can't do that now with PCRE2 and without requiring 20KB of stack
:/
So I created a pull request ([3]).

> >
> > Finally, I don't see how it's better than one allocation (or two) for
> > the whole lifetime of a thread, given that thread_local is available
> > on all the platforms (where performances matter), neither do I see
> > what are your grievances against thread_local..
> >
> 1. tls storage is growing to maximum captures count. So it requires
> more than one or two allocations.

For the lifetime of a thread and the number of ap_regexec() it may
call, that's all negligible.

> 2. tls storage doesn't support dynamic MPM that uses native threads

httpd-2.4 (and APR 1.8 unless reverted) provide the way to apr-ize a
native thread so that ap_thread_current() is available.

> 3. access to tls storage requires hash lookup.

Hmm no, util_pcre uses no hash lookup for instance
If one wants to handle TLS data with apr_thread_data_get(&data,
ap_current_thread()) then they will be stored in an apr_hash_t, but
there's nothing implicit. And by the way, using thread_data hashtable
won't be less efficient than registering as much native thread data
(e.g. pthread_key_t), I already mentioned that in another thread.

> 4. Windows build is currently broken due to tls storage API changes.

Some Windows dev to fix the trivial compilation error (which by the
way passed the 2.4.54 release process)?
I'm not, though I am trying to update the code of all platforms when possible..

> 5. Using tls reserves the memory, even if pcre is not being used
> afterwards. This can be thought as of implicitly increasing the
> per-thread stack size.

Which memory is reserved exactly for TLS?

> 6. With new code there will be no actual allocations in real-world cases.

Well, there will be no allocation but the 20KB stack space, though
that's true for the thread_local implementation too.
Now if PCRE2 stops using stack frames by allowing users to reserve
that space in match_data (like I propose in [3]), the thread_local
implementation clearly wins thanks to the ~almost~ single allocation
per thread, while your implementation would either use 20KB of stack
or malloc()ate them for every pcre2_match() call.
That'd be great if PCRE2 would allow for that, not to show that my
implementation is better but because 20KB of stack for a single
function in a lib is just insane (IMHO).


Regards;
Yann.

[1] https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_match.c#L6362
[2] https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_internal.h#L231
[3] https://github.com/PCRE2Project/pcre2/pull/134

Reply via email to