In perl.git, the branch smoke-me/khw-encode has been created <http://perl5.git.perl.org/perl.git/commitdiff/a8eb2e025035173ac08bf1371188a189466d64d2?hp=0000000000000000000000000000000000000000>
at a8eb2e025035173ac08bf1371188a189466d64d2 (commit) - Log ----------------------------------------------------------------- commit a8eb2e025035173ac08bf1371188a189466d64d2 Author: Karl Williamson <k...@cpan.org> Date: Sat Aug 20 15:16:06 2016 -0600 Speed up Encode UTF-8 validation checking This replaces the current scheme for checking UTF-8 validity by one in which normal processing doesn't require having to decode the UTF-8 into code points. The copying of characters individually from the input to the output is changed to be a single operation for each entire span of valid input at once. Thus in the normal case, what ends up happening is a tight loop to check the validity, and then a memmove of the entire input to the output, then return. If an error is found, it copies all the valid input before the error, then handles the character in error, then positions to the next input position and repeats. It uses the functionality available from the Perl 5 core to to look at just the bytes that comprise the UTF-8 to make the determination, converting to code points only those that are defective some how in order to display them in warnings and error messages. (The core macro it calls,isUTF8_CHAR(), currently does convert extremely large code points as well, only those well above any legal Unicode ones, and hence extremely unlikely to be encountered in practice.) Thus, this does not need to know about the intricacies of UTF-8 malformations, relying on the core to handle this. Not all the core facilities used are in the public API. That was true of the implementation this replaces as well. I'm confident enough in all the ones it does use to put them in the API. I have not looked at previous Perl versions to see how this would work on them. That will have to be tested and ppport used to overcome this. That should be done anyway to make sure we've got less buggy Unicode handling code available to older modules. M cpan/Encode/Encode.pm M cpan/Encode/Encode.xs M t/porting/customized.dat commit 40646e1822ebcb15f1f70d9153bd3714b2013372 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 12:28:21 2016 -0600 utf8.c: Use 'break' instead of 'goto' The goto is a relic of a previous implementation; 'break' is preferred if there isn't a reason to use goto. M utf8.c commit b3d5da70d866b2de261e195f4d0b68fb34991e39 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 12:25:00 2016 -0600 is_utf8_string_loc() param should not be NULL It makes no sense to call this function with a NULL parameter, as the whole point of using this function is to set what that param points to. If you don't want this, you should be using the similar function that doesn't have this parameter. M embed.fnc M proto.h commit 8ecdd3b937d2529c7df2eb884d6617ba7b62152f Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 12:21:06 2016 -0600 Document valid_utf8_to_uvchr() and inline it This function has been in several releases without problem, and is short enough that some compilers can inline it. This commit also notes that it is a pure function to the compiler, and that the result should not be ignored. M embed.fnc M embed.h M inline.h M proto.h M utf8.c commit e157288025538029bb90e34f66c737d7f1b03007 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 10:48:55 2016 -0600 utf8.c: Clarify comments for valid_utf8_to_uvchr() M utf8.c commit 78e0d689d5a60cb47b08cd58d00975b3162c9059 Author: Karl Williamson <k...@cpan.org> Date: Mon Aug 22 10:59:48 2016 -0600 utf8.c: Join EBCDIC/non-EBCDIC code This was missed in 534752c1d25d7c52c702337927c37e40c4df103d M utf8.c ----------------------------------------------------------------------- -- Perl5 Master Repository