On Sunday 21 August 2016 08:49:08 Karl Williamson wrote: > On 08/21/2016 02:34 AM, p...@cpan.org wrote: > >On Sunday 21 August 2016 03:10:40 Karl Williamson wrote: > >>Top posting. > >> > >>Attached is my alternative patch. It effectively uses a different > >>algorithm to avoid decoding the input into code points, and to copy > >>all spans of valid input at once, instead of character at a time. > >> > >>And it uses only currently available functions. > > > >And that's the problem. As already wrote in previous email, calling > >function from shared library cannot be heavy optimized as inlined > >function and cause slow down. You are calling is_utf8_string_loc for > >non-strict mode which is not inlined and so encode/decode of non-strict > >mode will be slower... > > > >And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which > >is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which > >cannot be inlined too... > > > >Therefore I think this is not good approach... > > > > Then you should run your benchmarks to find out the performance.
You are right, benchmarks are needed to show final results. > On valid input, is_utf8_string_loc() is called once per string. The > function call overhead and non-inlining should be not noticeable. Ah right, I misread it as it is called per one valid sequence, not for whole string. You are right. > On valid input, is_utf8_char_slow() is never called. The used-parts can be > inlined. Yes, but this function is there to be called primary on unknown input which does not have to be valid. If I know that input is valid then utf8::encode/decode is enough :-) > On invalid input, performance should be a minor consideration. See below... > The inner loop is much tighter in both functions; likely it can be held in > the cache. The algorithm avoids a bunch of work compared to the previous > one. Right, for valid input algorithm is really faster. If it is because of less case misses... maybe... I can play with perf or another tool to look what is bottle neck now. > I doubt that it will be slower than that. The only way to know in any > performance situation is to actually test. And know that things will be > different depending on the underlying hardware, so only large differences > are really significant. So, here are my test results. You can say that they are subjective, but I would be happy if somebody provide better input for performance tests. Abbreviations: strict = Encode::encode/decode "UTF-8" lax = Encode::encode/decode "utf8" int = utf8::encode/decode orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d Test cases: all = join "", map { chr } 0 .. 0x10FFFF short = "žluťoučký kůň pěl ďábelské ódy " x 45 long = $short x 1000 invalid-short = "\xA0" x 1000 invalid-long = "\xA0" x 1000000 Encoding was called on string with Encode::_utf8_on() flag. Rates: encode: all short long invalid-short invalid-long orig - strict 41/s 124533/s 132/s 115197/s 172/s your - strict 176/s 411523/s 427/s 54813/s 66/s my - strict 80/s 172712/s 186/s 113787/s 138/s orig - lax 1010/s 3225806/s 6250/s 546800/s 5151/s your - lax 952/s 3225806/s 5882/s 519325/s 4919/s my - lax 1060/s 3125000/s 6250/s 645119/s 5009/s orig - int 8154604/s 10000000/s infty 9787566/s 9748151/s your - int 9135243/s 11111111/s infty 8922821/s 9737657/s my - int 9779395/s 10000000/s infty 9822046/s 8949861/s decode: all short long invalid-short invalid-long orig - strict 39/s 119048/s 131/s 108574/s 171/s your - strict 173/s 353357/s 442/s 42440/s 55/s my - strict 69/s 166667/s 182/s 117291/s 135/s orig - lax 39/s 123609/s 137/s 127302/s 172/s your - lax 230/s 393701/s 495/s 37346/s 65/s my - lax 79/s 158983/s 180/s 121456/s 138/s orig - int 274/s 546448/s 565/s 8219513/s 12357/s your - int 273/s 540541/s 562/s 7226066/s 12948/s my - int 274/s 543478/s 562/s 8502902/s 12421/s int is there just for verifications of tests as utf8::encode/decode functions was not changed. Results are: your patch is faster for valid sequences (as you wrote above), but slower for invalid (in some cases radically). So I would propose two optimizations: 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new which does not call utf8n_to_uvchr. That call is not needed as in that case sequence is already invalid. 2) Try to make inline version of function is_utf8_string_loc(). Maybe merge with is_strict_utf8_string_loc()? That should boost non strict decoder for invalid sequences (now it is slower then strict one). And maybe it could make sense make all needed functions as part of public API. Are you going to prepare pull request for Encode module? Anyway, how it behave on EBCDIC platforms? And maybe another question what should Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to UTF-8 or to UTF-EBCDIC?