On Thursday 11 August 2016 17:41:23 Karl Williamson wrote: > On 07/09/2016 05:12 PM, p...@cpan.org wrote: > >Hi! As we know utf8::encode() does not provide correct UTF-8 encoding > >and Encode::encode("UTF-8", ...) should be used instead. Also opening > >file should be done by :encoding(UTF-8) layer instead :utf8. > > > >But UTF-8 strict implementation in Encode module is horrible slow when > >comparing to utf8::encode(). It is implemented in Encode.xs file and for > >benchmarking can be this XS implementation called directly by: > > > > use Encode; > > my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input) > > > >(without overhead of Encode module...) > > > >Here are my results on 160 bytes long input string: > > > > Encode::utf8::encode_xs({strict_utf8 => 1}, ...): 8 wallclock secs ( 8.56 > > usr + 0.00 sys = 8.56 CPU) @ 467289.72/s (n=4000000) > > Encode::utf8::encode_xs({strict_utf8 => 0}, ...): 1 wallclock secs ( 1.66 > > usr + 0.00 sys = 1.66 CPU) @ 2409638.55/s (n=4000000) > > utf8::encode: 1 wallclock secs ( 0.39 usr + 0.00 sys = 0.39 CPU) @ 10256410.26/s (n=4000000) > > > >I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions) > >and did some optimizations. Final results are: > > > > Encode::utf8::encode_xs({strict_utf8 => 1}, ...): 2 wallclock secs ( 3.27 > > usr + 0.00 sys = 3.27 CPU) @ 1223241.59/s (n=4000000) > > Encode::utf8::encode_xs({strict_utf8 => 0}, ...): 1 wallclock secs ( 1.68 > > usr + 0.00 sys = 1.68 CPU) @ 2380952.38/s (n=4000000) > > utf8::encode: 1 wallclock secs ( 0.40 usr + 0.00 sys = 0.40 CPU) @ 10000000.00/s (n=4000000) > > > >Patches are on github at pull request: > >https://github.com/dankogai/p5-encode/pull/56 > > > >I would like if somebody review my patches and tell if this is the > >right way for optimizations... > > > > I'm sorry that this slipped off my radar until I saw it in the new Encode > release > > There are a couple of things I see wrong with your patch. > > 1) It does not catch the malformation of an overlong sequence. This is a > serious malformation which has been used for attacks. Basically, after you > get the result, you need to check that it is the expected length for that > result. For example, \xC2\x80 will have an input length of 2, and evaluates > to \x00, whose expected length is 1, and so the input is overlong. In > modern perls, you can just do an OFFUNISKIP(uv) and compare that with the > passed-in length. This can be rewritten for perls back to 5.8 using > UNI_SKIP and UNI_TO_NATIVE
I do not see where can be a problem. At least I think my patches should be compatible with previous implementation of Encode.xs... First UTF8_IS_INVARIANT is checked and one character processed. Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of sequence. And then len-1 characters are checked if they pass test for UTF8_IS_CONTINUATION. If there are less characters then following does not UTF8_IS_CONTINUATION and error is reported. If there are more, then next iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and UTF8_IS_START. Can you describe in details what do you think it wrong and how to do that attack? > 2) It does not work on EBCDIC platforms. The NATIVE_TO_UTF() call is a good > start, but the result uv needs to be transformed back to native, using > UNI_TO_NATIVE(uv). uv is used just to check if it is valid Unicode code point. Real value is used only for error/warn message. Previous implementation used utf8n_to_uvuni which convert return value with NATIVE_TO_UNI. > 3) The assumptions the subroutine runs under need to be documented for > future maintainers and code readers. For example, it assumes that there is > enough space in the input to hold all the bytes. Function process_utf8 does not assume that. It calls SvGROW to increase buffer size when needed. > Other than that, it looks ok to me. But, to be sure, I think you should run > it on the tests included in the core t/op/utf8decode.t which came from an > internet repository of edge cases. How to use and run that test with Encode?