[pcre-dev] Support invalid UTF subject strings by PCRE2-JIT

Zoltán Herczeg Mon, 17 Sep 2018 12:27:53 -0700

Dear PCRE2 users,

since PCRE 10.32 has been released, it is time for announcing a new major 
feature for PCRE2-JIT: supporting invalid UTF subject strings. This feature can 
be enabled by passing PCRE2_JIT_INVALID_UTF option to pcre2_jit_compile(). It 
is recommended to use pcre2_jit_match() after the pattern is compiled.


Regular expressions (regardless they are traditional automata based or newer 
pattern matching script languages) are designed to search character sequences 
in a textual input where each character has an attribute list and these 
attributes can be used to control the matching process. For example patterns 
can be constructed to search lowercase Greek words or full Latin sentences.

Currently Unicode is the most popular encoding for written texts. Unicode 
characters are called code points and the Unicode standard provides a long list 
of attributes for each code point. The UTF (Unicode Transformation Format) has 
been created to encode these code points as byte sequences. However this 
encoding does not use all possible byte values, so a random binary input may 
contain bytes which cannot be decoded as code points. When 
PCRE2_JIT_INVALID_UTF option is enabled the generated code can detect these 
bytes. Since they are not valid code points nothing matches to them, not even a 
dot with PCRE2_DOTALL option or a \p{Any}. Zero width assertions require valid 
code points as well, e.g. a word boundary check (\b) fails if either side is 
not a valid UTF character. Therefore the result of a successful match is always 
a valid UTF string regardless of PCRE2_JIT_INVALID_UTF option.

While enabling PCRE2_JIT_INVALID_UTF option has a performance overhead, it 
might be still faster that converting a binary data to valid UTF first, 
especially if a match is found at the beginning of a sizable input. Even when 
this option is enabled, the UTF code units must still be aligned: an UTF-16/32 
subject string must be uint16_t/uint32_t aligned.

This feature is a JIT only feature, no plans to support it in the PCRE2 
interpreter because of the increased runtime. Furthermore a large amount of new 
code has been added so if you are interested and have some time please try it. 
The latest code is available in the svn repository:

svn co svn://vcs.exim.org/pcre2/code/trunk pcre2

Any feedback is welcome.

Regards,
Zoltan
 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] Support invalid UTF subject strings by PCRE2-JIT

Reply via email to