Re: Encode UTF-8 optimizations
On Monday 22 August 2016 23:38:05 Karl Williamson wrote: > And, I'd rather not tweak it to call UTF8_IS_SUPER first, > because that relies on knowing what the current internal > implementation is. Then maybe add new macro isUTF8_CHAR_STRICT which only check if character is strictly valid UTF-8? I think that such macro can be useful...
Re: Encode UTF-8 optimizations
(this only applies for strict UTF-8) On Monday 22 August 2016 23:19:51 Karl Williamson wrote: > The code could be tweaked to call UTF8_IS_SUPER first, but I'm > asserting that an optimizing compiler will see that any call to > is_utf8_char_slow() is pointless, and will optimize it out. Such optimization cannot be done and compiler cannot know such thing... You have this code: +const STRLEN char_len = isUTF8_CHAR(x, send); + +if (UNLIKELY(! char_len) +|| (UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*x)) +&& ( UNLIKELY(UTF8_IS_SURROGATE(x, send)) +|| UNLIKELY(UTF8_IS_SUPER(x, send)) +|| UNLIKELY(UTF8_IS_NONCHAR(x, send) +{ +*ep = x; +return FALSE; +} Here isUTF8_CHAR() macro will call function is_utf8_char_slow() if condition IS_UTF8_CHAR_FAST(UTF8SKIP(x))) is truth. And because is_utf8_char_slow() is external library function compiler has absolutely no idea what that function is doing. In non-functional world such function could have side effect, etc and compiler really cannot eliminate that call. Moving UTF8_IS_SUPER before isUTF8_CHAR maybe could help, but I'm septic if gcc really can propagate constant from PL_utf8skip[] array back and prove that IS_UTF8_CHAR_FAST must be always true when UTF8_IS_SUPER is true too... Rather add IS_UTF8_CHAR_FAST(UTF8SKIP(s))) check (or similar) before isUTF8_CHAR() call. That should totally eliminate generating code with call to is_utf8_char_slow() function. With UTF8_IS_SUPER there can be branch in binary code which never will be evaluated.
Re: Encode UTF-8 optimizations
On 08/22/2016 03:19 PM, Karl Williamson wrote: On 08/22/2016 02:47 PM, p...@cpan.org wrote: > And I think you misunderstand when is_utf8_char_slow() is called. It is > called only when the next byte in the input indicates that the only > legal UTF-8 that might follow would be for a code point that is at least > U+20, almost twice as high as the highest legal Unicode code point. > It is a Perl extension to handle such code points, unlike other > languages. But the Perl core is not optimized for them, nor will it be. > My point is that is_utf8_char_slow() will only be called in very > specialized cases, and we need not make those cases have as good a > performance as normal ones. In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in strict mode such sequence must be always invalid (it is above last valid Unicode character) This is what I tried to tell. And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which could call is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER(). I only have time to respond to this portion just now. The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting that an optimizing compiler will see that any call to is_utf8_char_slow() is pointless, and will optimize it out. Now, I'm realizing I'm wrong. It can't be optimized out by the compiler because it is not declared (nor can it be) to be a pure function. And, I'd rather not tweak it to call UTF8_IS_SUPER first, because that relies on knowing what the current internal implementation is. But I still argue that it is fine the way it is. It will only get called for code points much higher than Unicode, and the performance on those should not affect our decisions in any way.
Re: Encode UTF-8 optimizations
On 08/22/2016 02:47 PM, p...@cpan.org wrote: > And I think you misunderstand when is_utf8_char_slow() is called. It is > called only when the next byte in the input indicates that the only > legal UTF-8 that might follow would be for a code point that is at least > U+20, almost twice as high as the highest legal Unicode code point. > It is a Perl extension to handle such code points, unlike other > languages. But the Perl core is not optimized for them, nor will it be. > My point is that is_utf8_char_slow() will only be called in very > specialized cases, and we need not make those cases have as good a > performance as normal ones. In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in strict mode such sequence must be always invalid (it is above last valid Unicode character) This is what I tried to tell. And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which could call is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER(). I only have time to respond to this portion just now. The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting that an optimizing compiler will see that any call to is_utf8_char_slow() is pointless, and will optimize it out.
Re: Encode UTF-8 optimizations
On Monday 22 August 2016 21:43:59 Karl Williamson wrote: > On 08/22/2016 07:05 AM, p...@cpan.org wrote: > > On Sunday 21 August 2016 08:49:08 Karl Williamson wrote: > >> On 08/21/2016 02:34 AM, p...@cpan.org wrote: > >>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote: > Top posting. > > Attached is my alternative patch. It effectively uses a different > algorithm to avoid decoding the input into code points, and to copy > all spans of valid input at once, instead of character at a time. > > And it uses only currently available functions. > >>> > >>> And that's the problem. As already wrote in previous email, calling > >>> function from shared library cannot be heavy optimized as inlined > >>> function and cause slow down. You are calling is_utf8_string_loc for > >>> non-strict mode which is not inlined and so encode/decode of non-strict > >>> mode will be slower... > >>> > >>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which > >>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which > >>> cannot be inlined too... > >>> > >>> Therefore I think this is not good approach... > >>> > >> > >> Then you should run your benchmarks to find out the performance. > > > > You are right, benchmarks are needed to show final results. > > > >> On valid input, is_utf8_string_loc() is called once per string. The > >> function call overhead and non-inlining should be not noticeable. > > > > Ah right, I misread it as it is called per one valid sequence, not for > > whole string. You are right. > > It is called once per valid sequence. See below. > > > > >> On valid input, is_utf8_char_slow() is never called. The used-parts can be > >> inlined. > > > > Yes, but this function is there to be called primary on unknown input > > which does not have to be valid. If I know that input is valid then > > utf8::encode/decode is enough :-) > > What process_utf8() does is to copy the alleged UTF-8 input to the > output, verifying along the way that it actually is legal UTF-8 (with 2 > levels of strictness, depending on the input parameter), and taking some > actions (exactly what depends on other input parameters) if and when it > finds invalid UTF-8. > > The way it works after my patch is like an instruction pipeline. You > start it up, and it stays in the pipeline as long as the next character > in the input is legal or it reaches the end. When it finds illegal > input, it drops out of the pipeline, handles that, and starts up the > pipeline to process any remaining input. If the entire input string is > valid, a single instance of the pipeline is all that gets invoked. Yes, I figured out how it works. > The use-case I envision is that the input is supposed to be valid UTF-8, > and the purpose of process_utf8() is to verify that that is in fact > true, and to take specified actions when it isn't. Right! > Under that use-case, > taking longer to deal with invalid input is not a problem. If that is > not your use-case, please explain what yours is. Basically Encode::decode("UTF-8", $input) is used for converting "untrusted" input sequence (e.g. from network or local file) to perl Unicode scalar. And if input contains something invalid, then Encode::decode do anything needed to return valid Unicode string (= replace invalid subsequences by Unicode replacement character). So Encode::decode("UTF-8", $input) is there for processing any input sequences, not only valid, also broken or totally invalid. > And I think you misunderstand when is_utf8_char_slow() is called. It is > called only when the next byte in the input indicates that the only > legal UTF-8 that might follow would be for a code point that is at least > U+20, almost twice as high as the highest legal Unicode code point. > It is a Perl extension to handle such code points, unlike other > languages. But the Perl core is not optimized for them, nor will it be. > My point is that is_utf8_char_slow() will only be called in very > specialized cases, and we need not make those cases have as good a > performance as normal ones. In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in strict mode such sequence must be always invalid (it is above last valid Unicode character) This is what I tried to tell. And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which could call is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER(). So maybe it could make sense to provide some "strict" version of isUTF8_CHAR() macro as it such strict version does not have to call is_utf8_char_slow(). > >> On invalid input, performance should be a minor consideration. > > > > See below... > > See above. :) > > > > >> The inner loop is much tighter in both functions; likely it can be held in > >> the cache. The algorithm avoids a bunch of work compared to the previous > >> one. > > > > Right, for valid input algo
Re: Encode UTF-8 optimizations
On 08/22/2016 07:05 AM, p...@cpan.org wrote: On Sunday 21 August 2016 08:49:08 Karl Williamson wrote: On 08/21/2016 02:34 AM, p...@cpan.org wrote: On Sunday 21 August 2016 03:10:40 Karl Williamson wrote: Top posting. Attached is my alternative patch. It effectively uses a different algorithm to avoid decoding the input into code points, and to copy all spans of valid input at once, instead of character at a time. And it uses only currently available functions. And that's the problem. As already wrote in previous email, calling function from shared library cannot be heavy optimized as inlined function and cause slow down. You are calling is_utf8_string_loc for non-strict mode which is not inlined and so encode/decode of non-strict mode will be slower... And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which cannot be inlined too... Therefore I think this is not good approach... Then you should run your benchmarks to find out the performance. You are right, benchmarks are needed to show final results. On valid input, is_utf8_string_loc() is called once per string. The function call overhead and non-inlining should be not noticeable. Ah right, I misread it as it is called per one valid sequence, not for whole string. You are right. It is called once per valid sequence. See below. On valid input, is_utf8_char_slow() is never called. The used-parts can be inlined. Yes, but this function is there to be called primary on unknown input which does not have to be valid. If I know that input is valid then utf8::encode/decode is enough :-) What process_utf8() does is to copy the alleged UTF-8 input to the output, verifying along the way that it actually is legal UTF-8 (with 2 levels of strictness, depending on the input parameter), and taking some actions (exactly what depends on other input parameters) if and when it finds invalid UTF-8. The way it works after my patch is like an instruction pipeline. You start it up, and it stays in the pipeline as long as the next character in the input is legal or it reaches the end. When it finds illegal input, it drops out of the pipeline, handles that, and starts up the pipeline to process any remaining input. If the entire input string is valid, a single instance of the pipeline is all that gets invoked. The use-case I envision is that the input is supposed to be valid UTF-8, and the purpose of process_utf8() is to verify that that is in fact true, and to take specified actions when it isn't. Under that use-case, taking longer to deal with invalid input is not a problem. If that is not your use-case, please explain what yours is. And I think you misunderstand when is_utf8_char_slow() is called. It is called only when the next byte in the input indicates that the only legal UTF-8 that might follow would be for a code point that is at least U+20, almost twice as high as the highest legal Unicode code point. It is a Perl extension to handle such code points, unlike other languages. But the Perl core is not optimized for them, nor will it be. My point is that is_utf8_char_slow() will only be called in very specialized cases, and we need not make those cases have as good a performance as normal ones. On invalid input, performance should be a minor consideration. See below... See above. :) The inner loop is much tighter in both functions; likely it can be held in the cache. The algorithm avoids a bunch of work compared to the previous one. Right, for valid input algorithm is really faster. If it is because of less case misses... maybe... I can play with perf or another tool to look what is bottle neck now. I doubt that it will be slower than that. The only way to know in any performance situation is to actually test. And know that things will be different depending on the underlying hardware, so only large differences are really significant. So, here are my test results. You can say that they are subjective, but I would be happy if somebody provide better input for performance tests. Abbreviations: strict = Encode::encode/decode "UTF-8" lax = Encode::encode/decode "utf8" int = utf8::encode/decode orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d Test cases: all = join "", map { chr } 0 .. 0x10 short = "žluťoučký kůň pěl ďábelské ódy " x 45 long = $short x 1000 invalid-short = "\xA0" x 1000 invalid-long = "\xA0" x 100 Encoding was called on string with Encode::_utf8_on() flag. Rates: encode: all short long invalid-short invalid-long orig - strict 41/s124533/s132/s 115197/s172/s your - strict 176/s411523/s427/s 54813/s 66/s my - strict 80/s172712/s186/s 11378
Re: Encode utf8 warnings
On Saturday 13 August 2016 19:41:46 p...@cpan.org wrote: > Hello, I see that there is one big mess in utf8 warnings for Encode. Per request this discussion was moved to perl5-port...@perl.org ML: http://www.nntp.perl.org/group/perl.perl5.porters/2016/08/msg239061.html
Re: Encode UTF-8 optimizations
On Sunday 21 August 2016 08:49:08 Karl Williamson wrote: > On 08/21/2016 02:34 AM, p...@cpan.org wrote: > >On Sunday 21 August 2016 03:10:40 Karl Williamson wrote: > >>Top posting. > >> > >>Attached is my alternative patch. It effectively uses a different > >>algorithm to avoid decoding the input into code points, and to copy > >>all spans of valid input at once, instead of character at a time. > >> > >>And it uses only currently available functions. > > > >And that's the problem. As already wrote in previous email, calling > >function from shared library cannot be heavy optimized as inlined > >function and cause slow down. You are calling is_utf8_string_loc for > >non-strict mode which is not inlined and so encode/decode of non-strict > >mode will be slower... > > > >And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which > >is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which > >cannot be inlined too... > > > >Therefore I think this is not good approach... > > > > Then you should run your benchmarks to find out the performance. You are right, benchmarks are needed to show final results. > On valid input, is_utf8_string_loc() is called once per string. The > function call overhead and non-inlining should be not noticeable. Ah right, I misread it as it is called per one valid sequence, not for whole string. You are right. > On valid input, is_utf8_char_slow() is never called. The used-parts can be > inlined. Yes, but this function is there to be called primary on unknown input which does not have to be valid. If I know that input is valid then utf8::encode/decode is enough :-) > On invalid input, performance should be a minor consideration. See below... > The inner loop is much tighter in both functions; likely it can be held in > the cache. The algorithm avoids a bunch of work compared to the previous > one. Right, for valid input algorithm is really faster. If it is because of less case misses... maybe... I can play with perf or another tool to look what is bottle neck now. > I doubt that it will be slower than that. The only way to know in any > performance situation is to actually test. And know that things will be > different depending on the underlying hardware, so only large differences > are really significant. So, here are my test results. You can say that they are subjective, but I would be happy if somebody provide better input for performance tests. Abbreviations: strict = Encode::encode/decode "UTF-8" lax = Encode::encode/decode "utf8" int = utf8::encode/decode orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d Test cases: all = join "", map { chr } 0 .. 0x10 short = "žluťoučký kůň pěl ďábelské ódy " x 45 long = $short x 1000 invalid-short = "\xA0" x 1000 invalid-long = "\xA0" x 100 Encoding was called on string with Encode::_utf8_on() flag. Rates: encode: all short long invalid-short invalid-long orig - strict 41/s124533/s132/s 115197/s172/s your - strict 176/s411523/s427/s 54813/s 66/s my - strict 80/s172712/s186/s 113787/s138/s orig - lax 1010/s 3225806/s 6250/s 546800/s 5151/s your - lax952/s 3225806/s 5882/s 519325/s 4919/s my - lax 1060/s 3125000/s 6250/s 645119/s 5009/s orig - int8154604/s 1000/sinfty9787566/s9748151/s your - int9135243/s /sinfty8922821/s9737657/s my - int9779395/s 1000/sinfty9822046/s8949861/s decode: all short long invalid-short invalid-long orig - strict 39/s119048/s131/s 108574/s171/s your - strict 173/s353357/s442/s 42440/s 55/s my - strict 69/s17/s182/s 117291/s135/s orig - lax 39/s123609/s137/s 127302/s172/s your - lax230/s393701/s495/s 37346/s 65/s my - lax 79/s158983/s180/s 121456/s138/s orig - int274/s546448/s565/s8219513/s 12357/s your - int273/s540541/s562/s7226066/s 12948/s my - int274/s543478/s562/s8502902/s 12421/s int is there just for verifications of tests as utf8::encode/decode functions was not changed. Results are: your patch is faster for valid sequences (as you wrote above), but slower for invalid (in some cases radically). So I would propose two optimizations: 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new which does not call utf8n_to_uvchr. That call is not needed as in that case sequence is already invalid. 2) Try to make inline version of function is_utf8_string_loc(). Maybe merge with is_st