Re: Encode UTF-8 optimizations

pali Mon, 22 Aug 2016 13:48:41 -0700

On Monday 22 August 2016 21:43:59 Karl Williamson wrote:
> On 08/22/2016 07:05 AM, p...@cpan.org wrote:
> > On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
> >> On 08/21/2016 02:34 AM, p...@cpan.org wrote:
> >>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> >>>> Top posting.
> >>>>
> >>>> Attached is my alternative patch.  It effectively uses a different
> >>>> algorithm to avoid decoding the input into code points, and to copy
> >>>> all spans of valid input at once, instead of character at a time.
> >>>>
> >>>> And it uses only currently available functions.
> >>>
> >>> And that's the problem. As already wrote in previous email, calling
> >>> function from shared library cannot be heavy optimized as inlined
> >>> function and cause slow down. You are calling is_utf8_string_loc for
> >>> non-strict mode which is not inlined and so encode/decode of non-strict
> >>> mode will be slower...
> >>>
> >>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> >>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> >>> cannot be inlined too...
> >>>
> >>> Therefore I think this is not good approach...
> >>>
> >>
> >> Then you should run your benchmarks to find out the performance.
> >
> > You are right, benchmarks are needed to show final results.
> >
> >> On valid input, is_utf8_string_loc() is called once per string.  The
> >> function call overhead and non-inlining should be not noticeable.
> >
> > Ah right, I misread it as it is called per one valid sequence, not for
> > whole string. You are right.
> 
> It is called once per valid sequence.  See below.
> 
> >
> >> On valid input, is_utf8_char_slow() is never called.  The used-parts can be
> >> inlined.
> >
> > Yes, but this function is there to be called primary on unknown input
> > which does not have to be valid. If I know that input is valid then
> > utf8::encode/decode is enough :-)
> 
> What process_utf8() does is to copy the alleged UTF-8 input to the 
> output, verifying along the way that it actually is legal UTF-8 (with 2 
> levels of strictness, depending on the input parameter), and taking some 
> actions (exactly what depends on other input parameters) if and when it 
> finds invalid UTF-8.
> 
> The way it works after my patch is like an instruction pipeline.  You 
> start it up, and it stays in the pipeline as long as the next character 
> in the input is legal or it reaches the end.  When it finds illegal 
> input, it drops out of the pipeline, handles that, and starts up the 
> pipeline to process any remaining input.  If the entire input string is 
> valid, a single instance of the pipeline is all that gets invoked.


Yes, I figured out how it works.

> The use-case I envision is that the input is supposed to be valid UTF-8, 
> and the purpose of process_utf8() is to verify that that is in fact 
> true, and to take specified actions when it isn't.

Right!

> Under that use-case, 
> taking longer to deal with invalid input is not a problem.  If that is 
> not your use-case, please explain what yours is.

Basically Encode::decode("UTF-8", $input) is used for converting "untrusted" 
input 
sequence (e.g. from network or local file) to perl Unicode scalar. And if input 
contains something invalid, then Encode::decode do anything needed to return 
valid 
Unicode string (= replace invalid subsequences by Unicode replacement 
character). So 
Encode::decode("UTF-8", $input) is there for processing any input sequences, 
not only 
valid, also broken or totally invalid.

> And I think you misunderstand when is_utf8_char_slow() is called.  It is 
> called only when the next byte in the input indicates that the only 
> legal UTF-8 that might follow would be for a code point that is at least 
> U+200000, almost twice as high as the highest legal Unicode code point. 
> It is a Perl extension to handle such code points, unlike other 
> languages.  But the Perl core is not optimized for them, nor will it be. 
>   My point is that is_utf8_char_slow() will only be called in very 
> specialized cases, and we need not make those cases have as good a 
> performance as normal ones.

In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in 
strict 
mode such sequence must be always invalid (it is above last valid Unicode 
character) 
This is what I tried to tell.

And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which 
could call 
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().

So maybe it could make sense to provide some "strict" version of isUTF8_CHAR() 
macro as 
it such strict version does not have to call is_utf8_char_slow().

> >> On invalid input, performance should be a minor consideration.
> >
> > See below...
> 
> See above. :)
> 
> >
> >> The inner loop is much tighter in both functions; likely it can be held in
> >> the cache.  The algorithm avoids a bunch of work compared to the previous
> >> one.
> >
> > Right, for valid input algorithm is really faster. If it is because of
> > less case misses... maybe... I can play with perf or another tool to
> > look what is bottle neck now.
> >
> >> I doubt that it will be slower than that.  The only way to know in any
> >> performance situation is to actually test.  And know that things will be
> >> different depending on the underlying hardware, so only large differences
> >> are really significant.
> >
> > So, here are my test results. You can say that they are subjective, but
> > I would be happy if somebody provide better input for performance tests.
> >
> > Abbreviations:
> > strict = Encode::encode/decode "UTF-8"
> > lax = Encode::encode/decode "utf8"
> > int = utf8::encode/decode
> > orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
> > your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
> > my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d
> >
> > Test cases:
> > all = join "", map { chr } 0 .. 0x10FFFF
> > short = "žluťoučký kůň pěl ďábelské ódy " x 45
> > long = $short x 1000
> > invalid-short = "\xA0" x 1000
> > invalid-long = "\xA0" x 1000000
> >
> > Encoding was called on string with Encode::_utf8_on() flag.
> >
> >
> > Rates:
> >
> > encode:
> >                    all       short     long  invalid-short invalid-long
> > orig - strict      41/s    124533/s    132/s     115197/s        172/s
> > your - strict     176/s    411523/s    427/s      54813/s         66/s
> > my   - strict      80/s    172712/s    186/s     113787/s        138/s
> >
> > orig - lax       1010/s   3225806/s   6250/s     546800/s       5151/s
> > your - lax        952/s   3225806/s   5882/s     519325/s       4919/s
> > my   - lax       1060/s   3125000/s   6250/s     645119/s       5009/s
> >
> > orig - int    8154604/s  10000000/s    infty    9787566/s    9748151/s
> > your - int    9135243/s  11111111/s    infty    8922821/s    9737657/s
> > my   - int    9779395/s  10000000/s    infty    9822046/s    8949861/s
> >
> >
> > decode:
> >                    all       short     long  invalid-short invalid-long
> > orig - strict      39/s    119048/s    131/s     108574/s        171/s
> > your - strict     173/s    353357/s    442/s      42440/s         55/s
> > my   - strict      69/s    166667/s    182/s     117291/s        135/s
> >
> > orig - lax         39/s    123609/s    137/s     127302/s        172/s
> > your - lax        230/s    393701/s    495/s      37346/s         65/s
> > my   - lax         79/s    158983/s    180/s     121456/s        138/s
> >
> > orig - int        274/s    546448/s    565/s    8219513/s      12357/s
> > your - int        273/s    540541/s    562/s    7226066/s      12948/s
> > my   - int        274/s    543478/s    562/s    8502902/s      12421/s
> >
> >
> > int is there just for verifications of tests as utf8::encode/decode
> > functions was not changed.
> >
> > Results are: your patch is faster for valid sequences (as you wrote
> > above), but slower for invalid (in some cases radically).
> >
> > So I would propose two optimizations:
> >
> > 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
> >    which does not call utf8n_to_uvchr. That call is not needed as in
> >    that case sequence is already invalid.
> 
> And an optimizing compiler should figure that out and omit the call, so 
> we shouldn't have to manually.  In the next few months I will be working 
> on some fixes to the :utf8 layer.  That could lead to a core function 
> similar to this, but without relying on the compiler to figure things out.

Great! But :utf8 layer is cannot be used for reading "untrusted" arbitrary 
files... 

> > 2) Try to make inline version of function is_utf8_string_loc(). Maybe
> >    merge with is_strict_utf8_string_loc()? That should boost non strict
> >    decoder for invalid sequences (now it is slower then strict one).
> 
> I'll look at which functions in utf8.c should be inlined.  This is a 
> candidate.  But I doubt that that is why this is slower in this case. 
> Read blead's perlhack.pod for performance testing tool hints.
> 
> Your comments caused me to realize that the call to utf8n_to_uvchr() 
> when the input is syntactically valid, but is an illegal code point 
> (like a surrogate, etc) could be replaced by the faster 
> valid_utf8_to_uvchr(), which has not been in the public API.  I have 
> pushed to
> 
> http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode
> 
> a version of blead which has this function made public, and inlined, in 
> case you want to test with it.
> 
> There are some things in the error handling code that could be improved 
> upon.  For example, memcpy() of a single character is slower than just 
> copying a byte at a time, as the function call overhead dwarfs the savings.

This one memcpy in OK. It is called on constant string and I checked that gcc 
4.6 with 
-O2 on x86_64 already replace that call by instructions. No memcpy function 
call is 
used.

But maybe it could be rewritten... no idea if other compilers have 
optimizations for 
strlen/strcpy/memcpy functions on constant strings and in case when length is 
known.

> >
> > And maybe it could make sense make all needed functions as part of
> > public API.
> 
> Yes
> 
> >
> > Are you going to prepare pull request for Encode module?
> 
> I will, after everything is settled.  This may include changes to 
> Devel::PPPort to make sure this works on all perls that Encode supports.
> 
> However, I think it would be good to get the extra tests of malformed 
> inputs into the distribution.  (I haven't looked at your PR yet for 
> flaws.  I'll do that, and hopefully will find none, so will recommend 
> your PR be pulled.)

I added some tests for overlong sequences. Only for ASCII platforms, tests for 
EBCDIC 
are missing (sorry, I do not have access to any EBCDIC platform for testing).

> > Anyway, how it behave on EBCDIC platforms? And maybe another question
> > what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> > UTF-8 or to UTF-EBCDIC?
> 
> It works fine on EBCDIC platforms.  There are other bugs in Encode on 
> EBCDIC that I plan on investigating as time permits.  Doing this has 
> fixed some of these for free.  The uvuni() functions should in almost 
> all instances be uvchr(), and my patch does that.

Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think 
that it 
should be different for UTF-EBCDIC.

> On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if 
> you prefer), so $str will effectively be in the version of UTF-EBCDIC 
> valid for the platform it is running on (there are differences depending 
> on the platform's underlying code page).

So it means that on EBCDIC platforms you cannot process file which is encoded 
in UTF-8? 
As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in 
UTF-8 (as I 
understood).

Re: Encode UTF-8 optimizations

Reply via email to