Re: Encode UTF-8 optimizations

pali Fri, 12 Aug 2016 08:32:14 -0700

On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> On 07/09/2016 05:12 PM, [email protected] wrote:
> >Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >file should be done by :encoding(UTF-8) layer instead :utf8.
> >
> >But UTF-8 strict implementation in Encode module is horrible slow when
> >comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >benchmarking can be this XS implementation called directly by:
> >
> >  use Encode;
> >  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >
> >(without overhead of Encode module...)
> >
> >Here are my results on 160 bytes long input string:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 
> > usr +  
0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 
> > usr +  
0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
> >  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 
10256410.26/s (n=4000000)
> >
> >I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >and did some optimizations. Final results are:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 
> > usr +  
0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 
> > usr +  
0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
> >  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 
10000000.00/s (n=4000000)
> >
> >Patches are on github at pull request:
> >https://github.com/dankogai/p5-encode/pull/56
> >
> >I would like if somebody review my patches and tell if this is the
> >right way for optimizations...
> >
> 
> I'm sorry that this slipped off my radar until I saw it in the new Encode
> release
> 
> There are a couple of things I see wrong with your patch.
> 
> 1) It does not catch the malformation of an overlong sequence.  This is a
> serious malformation which has been used for attacks.  Basically, after you
> get the result, you need to check that it is the expected length for that
> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> to \x00, whose expected length is 1, and so the input is overlong.  In
> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> passed-in length.  This can be rewritten for perls back to 5.8 using
> UNI_SKIP and UNI_TO_NATIVE


I do not see where can be a problem. At least I think my patches should
be compatible with previous implementation of Encode.xs...

First UTF8_IS_INVARIANT is checked and one character processed.

Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
sequence. And then len-1 characters are checked if they pass test for
UTF8_IS_CONTINUATION.

If there are less characters then following does not
UTF8_IS_CONTINUATION and error is reported. If there are more, then next
iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
UTF8_IS_START.

Can you describe in details what do you think it wrong and how to do
that attack?

> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> start, but the result uv needs to be transformed back to native, using
> UNI_TO_NATIVE(uv).

uv is used just to check if it is valid Unicode code point. Real value
is used only for error/warn message. Previous implementation used
utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.

> 3) The assumptions the subroutine runs under need to be documented for
> future maintainers and code readers.  For example, it assumes that there is
> enough space in the input to hold all the bytes.

Function process_utf8 does not assume that. It calls SvGROW to increase
buffer size when needed.

> Other than that, it looks ok to me.  But, to be sure, I think you should run
> it on the tests included in the core t/op/utf8decode.t which came from an
> internet repository of edge cases.

How to use and run that test with Encode?

Re: Encode UTF-8 optimizations

Reply via email to