On 28/04/14 10:15 -0400, Tim Shen wrote:
On Mon, Apr 28, 2014 at 8:40 AM, Jonathan Wakely <jwak...@redhat.com> wrote:
I've been looking through the regex code and have a few ideas for
simplifications or optimisations that I'd like to share.

Thanks :)

This first patch is for _BracketMatcher. We only use std::bitset when
is_same<_CharT, char> so 8 * sizeof(_CharT) should be __CHAR_BIT__
instead. We also only user _UnsignedCharT when is_same<_CharT, char>
so that can just be simplified to unsigned char.

Yes, since _UnsignedCharT is just used as indexes, we can always use a
larger unsigned integer instead. Maybe "size_t" is a better choice?

There is a well-defined mapping from every unsigned char in the range
[0,255] to char and back, so conversions between char and unsigned
char are fine.  If we used a larger type then we would get the wrong result
when char is signed, because (size_t)(char)-1 != (unsigned char)(char)-1

I'm not sure if we'll have a wchar_t cache (bitset<65536>) in the future ;)

sizeof(wchar_t) is 4 on unix-like systems, but a cache for char16_t
wouldn't be totally crazy.

If you prefer I will not remove the uses of _CharT and _UnsignedCharT
and won't assume the cache is only used for char. There are still some
simplifications we can make.

The contents of _BracketMatcher::_M_char_set are not sorted and can
contain duplicates in the current code. Making that a sorted, unique
list in _BracketMatcher::_M_ready() allows a binary search instead of
linear search. This improves worst case performance for pathological
regular expressions like std::wregex('['+std::wstring(1000, 'a')+"b]")
but I'm not sure if it helps in the common case.

Trust me, in common case, even it's a bit slower, it won't be
observable. So it's just OK.

Finally, in the non-char case the _CacheT member is an unused empty
object, so having that as the first member requires 7 bytes of
padding. Re-ordering the members reduces the size of a non-char
_BracketMatcher by 8 bytes (but it's still a whopping 96 bytes).

(For a char _BracketMatcher the bitset cache makes it 128 bytes,
this patch doesn't change that).

This is OK too.

Thanks, I'll clean the patch up and commit it soon.

Reply via email to