In perl.git, the branch smoke-me/khw-encode has been created <http://perl5.git.perl.org/perl.git/commitdiff/655eab69b102d09f7e6643cbb78877097b23d994?hp=0000000000000000000000000000000000000000>
at 655eab69b102d09f7e6643cbb78877097b23d994 (commit) - Log ----------------------------------------------------------------- commit 655eab69b102d09f7e6643cbb78877097b23d994 Author: Karl Williamson <k...@cpan.org> Date: Sat Aug 20 15:16:06 2016 -0600 Speed up Encode UTF-8 validation checking This replaces the current scheme for checking UTF-8 validity by one in which normal processing doesn't require having to decode the UTF-8 into code points. The copying of characters individually from the input to the output is changed to be a single operation for each entire span of valid input at once. Thus in the normal case, what ends up happening is a tight loop to check the validity, and then a memmove of the entire input to the output, then return. If an error is found, it copies all the valid input before the error, then handles the character in error, then positions to the next input position, and repeats the whole process starting from there. It uses the functionality available from the Perl 5 core to to look at just the bytes that comprise the UTF-8 to make the determination, converting to code points only those that are defective some how in order to display them in warnings and error messages. Thus, this does not need to know about the intricacies of UTF-8 malformations, relying on the core to handle this. This cannot be pushed to CPAN until Devel::PPPort has been updated to implement all the functions now needed. M cpan/Encode/Encode.pm M cpan/Encode/Encode.xs M inline.h M t/porting/customized.dat commit 9ccc3ecd1119ccdb64e91b1f03376916aa8cc6f7 Author: Karl Williamson <k...@cpan.org> Date: Sun Aug 28 22:13:38 2016 -0600 XXX Experimental: Unroll loop in valid_utf8_to_uvchr Doing something like this didn't speed things up before, but now that the function is inline, it could. Needs to be tested. M inline.h commit b65e9a52d8b428146ee554d724b9274f8e77286c Author: Karl Williamson <k...@cpan.org> Date: Sun Aug 28 22:11:49 2016 -0600 XXX Experimental: Check validity and bypass in utf8n_to_uvchr This may speed this up, but performance needs to be checked. A flag can be created if the input is known to be malformed, so can skip the validity check M utf8.c commit b19206d64b88d47c6e4dd294a15063d28bf8e7bf Author: Karl Williamson <k...@cpan.org> Date: Sun Aug 28 22:04:16 2016 -0600 Use new is_utf8_valid_partial_char() This new function can be used in the implementation of the file test operators, -B and -T, to see if the whole fixed length buffer is valid UTF-8. Previously if all bytes were UTF-8 except the bytes at the end that could have been a partial character, it assumed the whole thing was UTF-8. This improves the prediction slightly M pp_sys.c commit 152374e094dd110f6d04e9a9214b406036ce249b Author: Karl Williamson <k...@cpan.org> Date: Sun Aug 28 10:54:13 2016 -0600 Add is_utf8_valid_partial_char() This new function can test some purported UTF-8 to see if it is well-formed as far as it goes. That is there aren't enough bytes for the character they start, but what is there is legal so far. This can be useful in a fixed width buffer, where the final character is split in the middle, and we want to test without waiting for the next read that the entire buffer is valid. M embed.fnc M embed.h M inline.h M proto.h M utf8.c commit 08beb688995f6bdc35c45ddfb4782f13abbbf8f4 Author: Karl Williamson <k...@cpan.org> Date: Sat Aug 27 21:17:49 2016 -0600 XXX (): Add C macros for UTF-8 for BOM and REPLACEMENT CHARACTER This makes it easy for module authors to write XS code that can use these characters, and be automatically portable to EBCDIC systems. M regen/unicode_constants.pl M unicode_constants.h commit 2065eaf909fa52c31e7e07a18dda5310abfb5ea1 Author: Karl Williamson <k...@cpan.org> Date: Sat Aug 27 20:08:52 2016 -0600 Make 3 UTF-8 macros API These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks. M regcharclass.h M regen/regcharclass.pl M utf8.h commit cd9e51ea74aa33f0916bbf283a3d2bcfff40a7b6 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 16:47:32 2016 -0600 utf8.c: Add comments M utf8.c commit c93869382792418a240c00b2dca8e46b374f9b00 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 16:53:00 2016 -0600 is_utf8_string() is now a pure function as of the previous commit M embed.fnc M inline.h M proto.h commit bdcb4496f8693b74dbca6984049997190125eb1f Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 16:29:54 2016 -0600 Move isUTF8_CHAR helper function, and reimplement it The macro isUTF8_CHAR calls a helper function for code points higher than it can handle. That function had been an inlined wrapper around utf8n_to_uvchr(). The function has been rewritten to not call utf8n_to_uvchr(), so it is now too big to be effectively inlined. Instead, it implements a faster method of checking the validity of the UTF-8 without having to decode it. It just checks for valid syntax and now knows where the few discontinuities are in UTF-8 where overlongs can occur, and uses a string compare to verify that overflow won't occur. As a result this is now a pure function. This also causes a previously generated deprecation warning to not be, because in printing UTF-8, no longer does it have to be converted to internal form. I could add a check for that, but I think it's best not to. If you manipulated what is getting printed in any way, the deprecation message will already have been raised. This commit also fleshes out the documentation of isUTF8_CHAR. M embed.fnc M embed.h M inline.h M proto.h M t/lib/warnings/utf8 M utf8.c M utf8.h commit 33de6cc8a77534112d767db5cba089e95bd2cd33 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 16:23:24 2016 -0600 Add #defines for UTF-8 of highest representable code point This will allow the next commit to not have to actually try to decode the UTF-8 string in order to see if it overflows the platform. M utf8.h M utfebcdic.h commit fbfd68864e6774db2839402550b69fb3045d981c Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 16:21:25 2016 -0600 utf8.h: Add some LIKELY() to help branch prediction This macro gives the legal UTF-8 byte sequences. Almost always, the input will be legal, so help compiler branch prediction for that. M utf8.h commit 14adab0723d1d634acbe8f1b1b829aafba429d41 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 16:07:22 2016 -0600 utf8.h, utfebcdic.h: Add comments, align white space M utf8.h M utfebcdic.h commit 4b130cec628f36cbaeafa50e069b2c5383b51dff Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 15:53:36 2016 -0600 Inline is_utf8_string() and is_utf8_stringloclen() M embed.fnc M inline.h M proto.h M utf8.c commit d9eaab0c5967e85253b7a51bba6692c238d673ec Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 15:03:52 2016 -0600 Inline utf8_distance(), utf8_hop() M embed.fnc M inline.h M proto.h M utf8.c commit a134bfd292800799138c8ddaaa12b164afcd0ae7 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 14:47:17 2016 -0600 Slightly simplify utf8_to_uvuni_buf() Use a function that does the same thing. This also clarifies a related comment M utf8.c commit 4be9d407bb57c3d17b4897a2b3a942df3822b5eb Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 14:07:50 2016 -0600 Inline is_utf8_invariant_string() M embed.fnc M embed.h M inline.h M proto.h M utf8.c commit cb1cd48a191bfc7229fa109b53f731035a714c09 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 13:54:51 2016 -0600 is_utf8_invariant_string is pure As are its synonyms M embed.fnc M proto.h commit da1ae61258a0dc53b4004a675d902799ad86ebe0 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 13:52:52 2016 -0600 Simplify slightly is_utf8_invariant_string This eliminates an unnecessary branch test in unoptimized code. M utf8.c commit b3cbb73593586a3e4ff1baa398af5e7a3f3b1f0f Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 13:42:53 2016 -0600 Use new name 'is_utf8_invariant_string' in core This changes the places in the core to use the clearer synonym added by the previous commit. It also changes one place that hand-rolled its own code to use this function instead. M ext/POSIX/POSIX.xs M ext/POSIX/lib/POSIX.pm M locale.c M mg.c M pp_sys.c M sv.c M toke.c commit 37a6dc0719a5d27e3751b90faaebba494c081696 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 13:35:28 2016 -0600 Add new synonym 'is_utf8_invariant_string' This is clearer as to its meaning than the existings 'is_ascii_string' and 'is_invariant_string', which are retained for back compat. M embed.fnc M embed.h M proto.h M utf8.c M utf8.h commit 63e9be9ce8ec0e8644fa5145e4e6887fbb7b1270 Author: Karl Williamson <k...@cpan.org> Date: Tue Aug 23 13:37:10 2016 -0600 embed.fnc: Replace blanks by tabs In this file, tabs are the more accepted field delimiter, and having them makes it easier to search for particular patterns in it. M embed.fnc commit a195aa6ac0d9a343e9b69504e11e37ef7c8bbf48 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 12:28:21 2016 -0600 utf8.c: Use 'break' instead of 'goto' The goto is a relic of a previous implementation; 'break' is preferred if there isn't a reason to use goto. M utf8.c commit 4d5766d6736dd54a3e3d4493a1f507808e4130c7 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 12:25:00 2016 -0600 is_utf8_string_loc() param should not be NULL It makes no sense to call this function with a NULL parameter, as the whole point of using this function is to set what that param points to. If you don't want this, you should be using the similar function that doesn't have this parameter. M embed.fnc M proto.h commit 371633dbdff825b4c8b35a20fc414481b076d942 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 12:21:06 2016 -0600 Document valid_utf8_to_uvchr() and inline it This function has been in several releases without problem, and is short enough that some compilers can inline it. This commit also notes that the result should not be ignored, and removes the unused pTHX. The function has explicitly been marked as being changeable, and has not bee part of the API until now. M embed.fnc M embed.h M inline.h M proto.h M utf8.c commit a448d6b39a53bdcd78b069176a2adc3c1d2c10f5 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 10:48:55 2016 -0600 utf8.c: Clarify comments for valid_utf8_to_uvchr() M utf8.c commit 217c687a55991a8de07803144a7f210dd90f73e4 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 10:59:48 2016 -0600 utf8.c: Join EBCDIC/non-EBCDIC code This was missed in 534752c1d25d7c52c702337927c37e40c4df103d M utf8.c commit 10e7e97296063e39ab24a5e9affb168b5e9d3317 Author: Karl Williamson <k...@cpan.org> Date: Fri Aug 26 15:25:20 2016 -0600 regen/embed.pl: Allow inline funcs to be named Perl_foo When inlining an existing public function whose name begins with Perl_, its best to keep that name, in case someone is calling it that way. Prior to this commit, the name had to be changed to S_foo. M embed.fnc M regen/embed.pl ----------------------------------------------------------------------- -- Perl5 Master Repository