On Monday 22 August 2016 21:43:59 Karl Williamson wrote: > On 08/22/2016 07:05 AM, p...@cpan.org wrote: > > On Sunday 21 August 2016 08:49:08 Karl Williamson wrote: > >> On 08/21/2016 02:34 AM, p...@cpan.org wrote: > >>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote: > >>>> Top posting. > >>>> > >>>> Attached is my alternative patch. It effectively uses a different > >>>> algorithm to avoid decoding the input into code points, and to copy > >>>> all spans of valid input at once, instead of character at a time. > >>>> > >>>> And it uses only currently available functions. > >>> > >>> And that's the problem. As already wrote in previous email, calling > >>> function from shared library cannot be heavy optimized as inlined > >>> function and cause slow down. You are calling is_utf8_string_loc for > >>> non-strict mode which is not inlined and so encode/decode of non-strict > >>> mode will be slower... > >>> > >>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which > >>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which > >>> cannot be inlined too... > >>> > >>> Therefore I think this is not good approach... > >>> > >> > >> Then you should run your benchmarks to find out the performance. > > > > You are right, benchmarks are needed to show final results. > > > >> On valid input, is_utf8_string_loc() is called once per string. The > >> function call overhead and non-inlining should be not noticeable. > > > > Ah right, I misread it as it is called per one valid sequence, not for > > whole string. You are right. > > It is called once per valid sequence. See below. > > > > >> On valid input, is_utf8_char_slow() is never called. The used-parts can be > >> inlined. > > > > Yes, but this function is there to be called primary on unknown input > > which does not have to be valid. If I know that input is valid then > > utf8::encode/decode is enough :-) > > What process_utf8() does is to copy the alleged UTF-8 input to the > output, verifying along the way that it actually is legal UTF-8 (with 2 > levels of strictness, depending on the input parameter), and taking some > actions (exactly what depends on other input parameters) if and when it > finds invalid UTF-8. > > The way it works after my patch is like an instruction pipeline. You > start it up, and it stays in the pipeline as long as the next character > in the input is legal or it reaches the end. When it finds illegal > input, it drops out of the pipeline, handles that, and starts up the > pipeline to process any remaining input. If the entire input string is > valid, a single instance of the pipeline is all that gets invoked.
Yes, I figured out how it works. > The use-case I envision is that the input is supposed to be valid UTF-8, > and the purpose of process_utf8() is to verify that that is in fact > true, and to take specified actions when it isn't. Right! > Under that use-case, > taking longer to deal with invalid input is not a problem. If that is > not your use-case, please explain what yours is. Basically Encode::decode("UTF-8", $input) is used for converting "untrusted" input sequence (e.g. from network or local file) to perl Unicode scalar. And if input contains something invalid, then Encode::decode do anything needed to return valid Unicode string (= replace invalid subsequences by Unicode replacement character). So Encode::decode("UTF-8", $input) is there for processing any input sequences, not only valid, also broken or totally invalid. > And I think you misunderstand when is_utf8_char_slow() is called. It is > called only when the next byte in the input indicates that the only > legal UTF-8 that might follow would be for a code point that is at least > U+200000, almost twice as high as the highest legal Unicode code point. > It is a Perl extension to handle such code points, unlike other > languages. But the Perl core is not optimized for them, nor will it be. > My point is that is_utf8_char_slow() will only be called in very > specialized cases, and we need not make those cases have as good a > performance as normal ones. In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in strict mode such sequence must be always invalid (it is above last valid Unicode character) This is what I tried to tell. And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which could call is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER(). So maybe it could make sense to provide some "strict" version of isUTF8_CHAR() macro as it such strict version does not have to call is_utf8_char_slow(). > >> On invalid input, performance should be a minor consideration. > > > > See below... > > See above. :) > > > > >> The inner loop is much tighter in both functions; likely it can be held in > >> the cache. The algorithm avoids a bunch of work compared to the previous > >> one. > > > > Right, for valid input algorithm is really faster. If it is because of > > less case misses... maybe... I can play with perf or another tool to > > look what is bottle neck now. > > > >> I doubt that it will be slower than that. The only way to know in any > >> performance situation is to actually test. And know that things will be > >> different depending on the underlying hardware, so only large differences > >> are really significant. > > > > So, here are my test results. You can say that they are subjective, but > > I would be happy if somebody provide better input for performance tests. > > > > Abbreviations: > > strict = Encode::encode/decode "UTF-8" > > lax = Encode::encode/decode "utf8" > > int = utf8::encode/decode > > orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b > > your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch > > my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d > > > > Test cases: > > all = join "", map { chr } 0 .. 0x10FFFF > > short = "žluťoučký kůň pěl ďábelské ódy " x 45 > > long = $short x 1000 > > invalid-short = "\xA0" x 1000 > > invalid-long = "\xA0" x 1000000 > > > > Encoding was called on string with Encode::_utf8_on() flag. > > > > > > Rates: > > > > encode: > > all short long invalid-short invalid-long > > orig - strict 41/s 124533/s 132/s 115197/s 172/s > > your - strict 176/s 411523/s 427/s 54813/s 66/s > > my - strict 80/s 172712/s 186/s 113787/s 138/s > > > > orig - lax 1010/s 3225806/s 6250/s 546800/s 5151/s > > your - lax 952/s 3225806/s 5882/s 519325/s 4919/s > > my - lax 1060/s 3125000/s 6250/s 645119/s 5009/s > > > > orig - int 8154604/s 10000000/s infty 9787566/s 9748151/s > > your - int 9135243/s 11111111/s infty 8922821/s 9737657/s > > my - int 9779395/s 10000000/s infty 9822046/s 8949861/s > > > > > > decode: > > all short long invalid-short invalid-long > > orig - strict 39/s 119048/s 131/s 108574/s 171/s > > your - strict 173/s 353357/s 442/s 42440/s 55/s > > my - strict 69/s 166667/s 182/s 117291/s 135/s > > > > orig - lax 39/s 123609/s 137/s 127302/s 172/s > > your - lax 230/s 393701/s 495/s 37346/s 65/s > > my - lax 79/s 158983/s 180/s 121456/s 138/s > > > > orig - int 274/s 546448/s 565/s 8219513/s 12357/s > > your - int 273/s 540541/s 562/s 7226066/s 12948/s > > my - int 274/s 543478/s 562/s 8502902/s 12421/s > > > > > > int is there just for verifications of tests as utf8::encode/decode > > functions was not changed. > > > > Results are: your patch is faster for valid sequences (as you wrote > > above), but slower for invalid (in some cases radically). > > > > So I would propose two optimizations: > > > > 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new > > which does not call utf8n_to_uvchr. That call is not needed as in > > that case sequence is already invalid. > > And an optimizing compiler should figure that out and omit the call, so > we shouldn't have to manually. In the next few months I will be working > on some fixes to the :utf8 layer. That could lead to a core function > similar to this, but without relying on the compiler to figure things out. Great! But :utf8 layer is cannot be used for reading "untrusted" arbitrary files... > > 2) Try to make inline version of function is_utf8_string_loc(). Maybe > > merge with is_strict_utf8_string_loc()? That should boost non strict > > decoder for invalid sequences (now it is slower then strict one). > > I'll look at which functions in utf8.c should be inlined. This is a > candidate. But I doubt that that is why this is slower in this case. > Read blead's perlhack.pod for performance testing tool hints. > > Your comments caused me to realize that the call to utf8n_to_uvchr() > when the input is syntactically valid, but is an illegal code point > (like a surrogate, etc) could be replaced by the faster > valid_utf8_to_uvchr(), which has not been in the public API. I have > pushed to > > http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode > > a version of blead which has this function made public, and inlined, in > case you want to test with it. > > There are some things in the error handling code that could be improved > upon. For example, memcpy() of a single character is slower than just > copying a byte at a time, as the function call overhead dwarfs the savings. This one memcpy in OK. It is called on constant string and I checked that gcc 4.6 with -O2 on x86_64 already replace that call by instructions. No memcpy function call is used. But maybe it could be rewritten... no idea if other compilers have optimizations for strlen/strcpy/memcpy functions on constant strings and in case when length is known. > > > > And maybe it could make sense make all needed functions as part of > > public API. > > Yes > > > > > Are you going to prepare pull request for Encode module? > > I will, after everything is settled. This may include changes to > Devel::PPPort to make sure this works on all perls that Encode supports. > > However, I think it would be good to get the extra tests of malformed > inputs into the distribution. (I haven't looked at your PR yet for > flaws. I'll do that, and hopefully will find none, so will recommend > your PR be pulled.) I added some tests for overlong sequences. Only for ASCII platforms, tests for EBCDIC are missing (sorry, I do not have access to any EBCDIC platform for testing). > > Anyway, how it behave on EBCDIC platforms? And maybe another question > > what should Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to > > UTF-8 or to UTF-EBCDIC? > > It works fine on EBCDIC platforms. There are other bugs in Encode on > EBCDIC that I plan on investigating as time permits. Doing this has > fixed some of these for free. The uvuni() functions should in almost > all instances be uvchr(), and my patch does that. Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think that it should be different for UTF-EBCDIC. > On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if > you prefer), so $str will effectively be in the version of UTF-EBCDIC > valid for the platform it is running on (there are differences depending > on the platform's underlying code page). So it means that on EBCDIC platforms you cannot process file which is encoded in UTF-8? As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in UTF-8 (as I understood).