On Monday 22 August 2016 21:43:59 Karl Williamson wrote:
> On 08/22/2016 07:05 AM, [email protected] wrote:
> > On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
> >> On 08/21/2016 02:34 AM, [email protected] wrote:
> >>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> >>>> Top posting.
> >>>>
> >>>> Attached is my alternative patch. It effectively uses a different
> >>>> algorithm to avoid decoding the input into code points, and to copy
> >>>> all spans of valid input at once, instead of character at a time.
> >>>>
> >>>> And it uses only currently available functions.
> >>>
> >>> And that's the problem. As already wrote in previous email, calling
> >>> function from shared library cannot be heavy optimized as inlined
> >>> function and cause slow down. You are calling is_utf8_string_loc for
> >>> non-strict mode which is not inlined and so encode/decode of non-strict
> >>> mode will be slower...
> >>>
> >>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> >>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> >>> cannot be inlined too...
> >>>
> >>> Therefore I think this is not good approach...
> >>>
> >>
> >> Then you should run your benchmarks to find out the performance.
> >
> > You are right, benchmarks are needed to show final results.
> >
> >> On valid input, is_utf8_string_loc() is called once per string. The
> >> function call overhead and non-inlining should be not noticeable.
> >
> > Ah right, I misread it as it is called per one valid sequence, not for
> > whole string. You are right.
>
> It is called once per valid sequence. See below.
>
> >
> >> On valid input, is_utf8_char_slow() is never called. The used-parts can be
> >> inlined.
> >
> > Yes, but this function is there to be called primary on unknown input
> > which does not have to be valid. If I know that input is valid then
> > utf8::encode/decode is enough :-)
>
> What process_utf8() does is to copy the alleged UTF-8 input to the
> output, verifying along the way that it actually is legal UTF-8 (with 2
> levels of strictness, depending on the input parameter), and taking some
> actions (exactly what depends on other input parameters) if and when it
> finds invalid UTF-8.
>
> The way it works after my patch is like an instruction pipeline. You
> start it up, and it stays in the pipeline as long as the next character
> in the input is legal or it reaches the end. When it finds illegal
> input, it drops out of the pipeline, handles that, and starts up the
> pipeline to process any remaining input. If the entire input string is
> valid, a single instance of the pipeline is all that gets invoked.
Yes, I figured out how it works.
> The use-case I envision is that the input is supposed to be valid UTF-8,
> and the purpose of process_utf8() is to verify that that is in fact
> true, and to take specified actions when it isn't.
Right!
> Under that use-case,
> taking longer to deal with invalid input is not a problem. If that is
> not your use-case, please explain what yours is.
Basically Encode::decode("UTF-8", $input) is used for converting "untrusted"
input
sequence (e.g. from network or local file) to perl Unicode scalar. And if input
contains something invalid, then Encode::decode do anything needed to return
valid
Unicode string (= replace invalid subsequences by Unicode replacement
character). So
Encode::decode("UTF-8", $input) is there for processing any input sequences,
not only
valid, also broken or totally invalid.
> And I think you misunderstand when is_utf8_char_slow() is called. It is
> called only when the next byte in the input indicates that the only
> legal UTF-8 that might follow would be for a code point that is at least
> U+200000, almost twice as high as the highest legal Unicode code point.
> It is a Perl extension to handle such code points, unlike other
> languages. But the Perl core is not optimized for them, nor will it be.
> My point is that is_utf8_char_slow() will only be called in very
> specialized cases, and we need not make those cases have as good a
> performance as normal ones.
In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in
strict
mode such sequence must be always invalid (it is above last valid Unicode
character)
This is what I tried to tell.
And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which
could call
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().
So maybe it could make sense to provide some "strict" version of isUTF8_CHAR()
macro as
it such strict version does not have to call is_utf8_char_slow().
> >> On invalid input, performance should be a minor consideration.
> >
> > See below...
>
> See above. :)
>
> >
> >> The inner loop is much tighter in both functions; likely it can be held in
> >> the cache. The algorithm avoids a bunch of work compared to the previous
> >> one.
> >
> > Right, for valid input algorithm is really faster. If it is because of
> > less case misses... maybe... I can play with perf or another tool to
> > look what is bottle neck now.
> >
> >> I doubt that it will be slower than that. The only way to know in any
> >> performance situation is to actually test. And know that things will be
> >> different depending on the underlying hardware, so only large differences
> >> are really significant.
> >
> > So, here are my test results. You can say that they are subjective, but
> > I would be happy if somebody provide better input for performance tests.
> >
> > Abbreviations:
> > strict = Encode::encode/decode "UTF-8"
> > lax = Encode::encode/decode "utf8"
> > int = utf8::encode/decode
> > orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
> > your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
> > my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d
> >
> > Test cases:
> > all = join "", map { chr } 0 .. 0x10FFFF
> > short = "žluťoučký kůň pěl ďábelské ódy " x 45
> > long = $short x 1000
> > invalid-short = "\xA0" x 1000
> > invalid-long = "\xA0" x 1000000
> >
> > Encoding was called on string with Encode::_utf8_on() flag.
> >
> >
> > Rates:
> >
> > encode:
> > all short long invalid-short invalid-long
> > orig - strict 41/s 124533/s 132/s 115197/s 172/s
> > your - strict 176/s 411523/s 427/s 54813/s 66/s
> > my - strict 80/s 172712/s 186/s 113787/s 138/s
> >
> > orig - lax 1010/s 3225806/s 6250/s 546800/s 5151/s
> > your - lax 952/s 3225806/s 5882/s 519325/s 4919/s
> > my - lax 1060/s 3125000/s 6250/s 645119/s 5009/s
> >
> > orig - int 8154604/s 10000000/s infty 9787566/s 9748151/s
> > your - int 9135243/s 11111111/s infty 8922821/s 9737657/s
> > my - int 9779395/s 10000000/s infty 9822046/s 8949861/s
> >
> >
> > decode:
> > all short long invalid-short invalid-long
> > orig - strict 39/s 119048/s 131/s 108574/s 171/s
> > your - strict 173/s 353357/s 442/s 42440/s 55/s
> > my - strict 69/s 166667/s 182/s 117291/s 135/s
> >
> > orig - lax 39/s 123609/s 137/s 127302/s 172/s
> > your - lax 230/s 393701/s 495/s 37346/s 65/s
> > my - lax 79/s 158983/s 180/s 121456/s 138/s
> >
> > orig - int 274/s 546448/s 565/s 8219513/s 12357/s
> > your - int 273/s 540541/s 562/s 7226066/s 12948/s
> > my - int 274/s 543478/s 562/s 8502902/s 12421/s
> >
> >
> > int is there just for verifications of tests as utf8::encode/decode
> > functions was not changed.
> >
> > Results are: your patch is faster for valid sequences (as you wrote
> > above), but slower for invalid (in some cases radically).
> >
> > So I would propose two optimizations:
> >
> > 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
> > which does not call utf8n_to_uvchr. That call is not needed as in
> > that case sequence is already invalid.
>
> And an optimizing compiler should figure that out and omit the call, so
> we shouldn't have to manually. In the next few months I will be working
> on some fixes to the :utf8 layer. That could lead to a core function
> similar to this, but without relying on the compiler to figure things out.
Great! But :utf8 layer is cannot be used for reading "untrusted" arbitrary
files...
> > 2) Try to make inline version of function is_utf8_string_loc(). Maybe
> > merge with is_strict_utf8_string_loc()? That should boost non strict
> > decoder for invalid sequences (now it is slower then strict one).
>
> I'll look at which functions in utf8.c should be inlined. This is a
> candidate. But I doubt that that is why this is slower in this case.
> Read blead's perlhack.pod for performance testing tool hints.
>
> Your comments caused me to realize that the call to utf8n_to_uvchr()
> when the input is syntactically valid, but is an illegal code point
> (like a surrogate, etc) could be replaced by the faster
> valid_utf8_to_uvchr(), which has not been in the public API. I have
> pushed to
>
> http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode
>
> a version of blead which has this function made public, and inlined, in
> case you want to test with it.
>
> There are some things in the error handling code that could be improved
> upon. For example, memcpy() of a single character is slower than just
> copying a byte at a time, as the function call overhead dwarfs the savings.
This one memcpy in OK. It is called on constant string and I checked that gcc
4.6 with
-O2 on x86_64 already replace that call by instructions. No memcpy function
call is
used.
But maybe it could be rewritten... no idea if other compilers have
optimizations for
strlen/strcpy/memcpy functions on constant strings and in case when length is
known.
> >
> > And maybe it could make sense make all needed functions as part of
> > public API.
>
> Yes
>
> >
> > Are you going to prepare pull request for Encode module?
>
> I will, after everything is settled. This may include changes to
> Devel::PPPort to make sure this works on all perls that Encode supports.
>
> However, I think it would be good to get the extra tests of malformed
> inputs into the distribution. (I haven't looked at your PR yet for
> flaws. I'll do that, and hopefully will find none, so will recommend
> your PR be pulled.)
I added some tests for overlong sequences. Only for ASCII platforms, tests for
EBCDIC
are missing (sorry, I do not have access to any EBCDIC platform for testing).
> > Anyway, how it behave on EBCDIC platforms? And maybe another question
> > what should Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> > UTF-8 or to UTF-EBCDIC?
>
> It works fine on EBCDIC platforms. There are other bugs in Encode on
> EBCDIC that I plan on investigating as time permits. Doing this has
> fixed some of these for free. The uvuni() functions should in almost
> all instances be uvchr(), and my patch does that.
Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think
that it
should be different for UTF-EBCDIC.
> On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if
> you prefer), so $str will effectively be in the version of UTF-EBCDIC
> valid for the platform it is running on (there are differences depending
> on the platform's underlying code page).
So it means that on EBCDIC platforms you cannot process file which is encoded
in UTF-8?
As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in
UTF-8 (as I
understood).