In perl.git, the branch smoke-me/khw-encode has been created

<http://perl5.git.perl.org/perl.git/commitdiff/655eab69b102d09f7e6643cbb78877097b23d994?hp=0000000000000000000000000000000000000000>

        at  655eab69b102d09f7e6643cbb78877097b23d994 (commit)

- Log -----------------------------------------------------------------
commit 655eab69b102d09f7e6643cbb78877097b23d994
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Aug 20 15:16:06 2016 -0600

    Speed up Encode UTF-8 validation checking
    
    This replaces the current scheme for checking UTF-8 validity by one
    in which normal processing doesn't require having to decode the UTF-8
    into code points.  The copying of characters individually from the input
    to the output is changed to be a single operation for each entire span
    of valid input at once.
    
    Thus in the normal case, what ends up happening is a tight loop to
    check the validity, and then a memmove of the entire input to the
    output, then return.
    
    If an error is found, it copies all the valid input before the error,
    then handles the character in error, then positions to the next input
    position, and repeats the whole process starting from there.
    
    It uses the functionality available from the Perl 5 core to to look at
    just the bytes that comprise the UTF-8 to make the determination,
    converting to code points only those that are defective some how in
    order to display them in warnings and error messages.
    
    Thus, this does not need to know about the intricacies of UTF-8
    malformations, relying on the core to handle this.
    
    This cannot be pushed to CPAN until Devel::PPPort has been updated to
    implement all the functions now needed.

M       cpan/Encode/Encode.pm
M       cpan/Encode/Encode.xs
M       inline.h
M       t/porting/customized.dat

commit 9ccc3ecd1119ccdb64e91b1f03376916aa8cc6f7
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Aug 28 22:13:38 2016 -0600

    XXX Experimental: Unroll loop in valid_utf8_to_uvchr
    
    Doing something like this didn't speed things up before, but now that
    the function is inline, it could.  Needs to be tested.

M       inline.h

commit b65e9a52d8b428146ee554d724b9274f8e77286c
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Aug 28 22:11:49 2016 -0600

    XXX Experimental: Check validity and bypass in utf8n_to_uvchr
    
    This may speed this up, but performance needs to be checked.  A flag can
    be created if the input is known to be malformed, so can skip the
    validity check

M       utf8.c

commit b19206d64b88d47c6e4dd294a15063d28bf8e7bf
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Aug 28 22:04:16 2016 -0600

    Use new is_utf8_valid_partial_char()
    
    This new function can be used in the implementation of the file test
    operators, -B and -T, to see if the whole fixed length buffer is valid
    UTF-8.  Previously if all bytes were UTF-8 except the bytes at the end
    that could have been a partial character, it assumed the whole thing was
    UTF-8.  This improves the prediction slightly

M       pp_sys.c

commit 152374e094dd110f6d04e9a9214b406036ce249b
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Aug 28 10:54:13 2016 -0600

    Add is_utf8_valid_partial_char()
    
    This new function can test some purported UTF-8 to see if it is
    well-formed as far as it goes.  That is there aren't enough bytes for
    the character they start, but what is there is legal so far.  This can
    be useful in a fixed width buffer, where the final character is split in
    the middle, and we want to test without waiting for the next read that
    the entire buffer is valid.

M       embed.fnc
M       embed.h
M       inline.h
M       proto.h
M       utf8.c

commit 08beb688995f6bdc35c45ddfb4782f13abbbf8f4
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Aug 27 21:17:49 2016 -0600

    XXX (): Add C macros for UTF-8 for BOM and REPLACEMENT CHARACTER
    
    This makes it easy for module authors to write XS code that can use
    these characters, and be automatically portable to EBCDIC systems.

M       regen/unicode_constants.pl
M       unicode_constants.h

commit 2065eaf909fa52c31e7e07a18dda5310abfb5ea1
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Aug 27 20:08:52 2016 -0600

    Make 3 UTF-8 macros API
    
    These may be useful to various module writers.  They certainly are
    useful for Encode.  This makes public API macros to determine if the
    input UTF-8 represents (one macro for each category)
        a) a surrogate code point
        b) a non-character code point
        c) a code point that is above Unicode's legal maximum.
    
    The macros are machine generated.  In making them public, I am now using
    the string end location parameter to guard against running off the end
    of the input.  Previously this parameter was ignored, as their use in
    the core could be tightly controlled so that we already knew that the
    string was long enough when calling these macros.  But this can't be
    guaranteed in the public API.  An optimizing compiler should be able to
    remove redundant length checks.

M       regcharclass.h
M       regen/regcharclass.pl
M       utf8.h

commit cd9e51ea74aa33f0916bbf283a3d2bcfff40a7b6
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 16:47:32 2016 -0600

    utf8.c: Add comments

M       utf8.c

commit c93869382792418a240c00b2dca8e46b374f9b00
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 16:53:00 2016 -0600

    is_utf8_string() is now a pure function
    
    as of the previous commit

M       embed.fnc
M       inline.h
M       proto.h

commit bdcb4496f8693b74dbca6984049997190125eb1f
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 16:29:54 2016 -0600

    Move isUTF8_CHAR helper function, and reimplement it
    
    The macro isUTF8_CHAR calls a helper function for code points higher
    than it can handle.  That function had been an inlined wrapper around
    utf8n_to_uvchr().
    
    The function has been rewritten to not call utf8n_to_uvchr(), so it is
    now too big to be effectively inlined.  Instead, it implements a faster
    method of checking the validity of the UTF-8 without having to decode
    it.  It just checks for valid syntax and now knows where the
    few discontinuities are in UTF-8 where overlongs can occur, and uses a
    string compare to verify that overflow won't occur.
    
    As a result this is now a pure function.
    
    This also causes a previously generated deprecation warning to not be,
    because in printing UTF-8, no longer does it have to be converted to
    internal form.  I could add a check for that, but I think it's best not
    to.  If you manipulated what is getting printed in any way, the
    deprecation message will already have been raised.
    
    This commit also fleshes out the documentation of isUTF8_CHAR.

M       embed.fnc
M       embed.h
M       inline.h
M       proto.h
M       t/lib/warnings/utf8
M       utf8.c
M       utf8.h

commit 33de6cc8a77534112d767db5cba089e95bd2cd33
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 16:23:24 2016 -0600

    Add #defines for UTF-8 of highest representable code point
    
    This will allow the next commit to not have to actually try to decode
    the UTF-8 string in order to see if it overflows the platform.

M       utf8.h
M       utfebcdic.h

commit fbfd68864e6774db2839402550b69fb3045d981c
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 16:21:25 2016 -0600

    utf8.h: Add some LIKELY() to help branch prediction
    
    This macro gives the legal UTF-8 byte sequences.  Almost always, the
    input will be legal, so help compiler branch prediction for that.

M       utf8.h

commit 14adab0723d1d634acbe8f1b1b829aafba429d41
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 16:07:22 2016 -0600

    utf8.h, utfebcdic.h: Add comments, align white space

M       utf8.h
M       utfebcdic.h

commit 4b130cec628f36cbaeafa50e069b2c5383b51dff
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 15:53:36 2016 -0600

    Inline is_utf8_string() and is_utf8_stringloclen()

M       embed.fnc
M       inline.h
M       proto.h
M       utf8.c

commit d9eaab0c5967e85253b7a51bba6692c238d673ec
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 15:03:52 2016 -0600

    Inline utf8_distance(), utf8_hop()

M       embed.fnc
M       inline.h
M       proto.h
M       utf8.c

commit a134bfd292800799138c8ddaaa12b164afcd0ae7
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 14:47:17 2016 -0600

    Slightly simplify utf8_to_uvuni_buf()
    
    Use a function that does the same thing.  This also clarifies a related
    comment

M       utf8.c

commit 4be9d407bb57c3d17b4897a2b3a942df3822b5eb
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 14:07:50 2016 -0600

    Inline is_utf8_invariant_string()

M       embed.fnc
M       embed.h
M       inline.h
M       proto.h
M       utf8.c

commit cb1cd48a191bfc7229fa109b53f731035a714c09
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 13:54:51 2016 -0600

    is_utf8_invariant_string is pure
    
    As are its synonyms

M       embed.fnc
M       proto.h

commit da1ae61258a0dc53b4004a675d902799ad86ebe0
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 13:52:52 2016 -0600

    Simplify slightly is_utf8_invariant_string
    
    This eliminates an unnecessary branch test in unoptimized code.

M       utf8.c

commit b3cbb73593586a3e4ff1baa398af5e7a3f3b1f0f
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 13:42:53 2016 -0600

    Use new name 'is_utf8_invariant_string' in core
    
    This changes the places in the core to use the clearer synonym added by
    the previous commit.  It also changes one place that hand-rolled its own
    code to use this function instead.

M       ext/POSIX/POSIX.xs
M       ext/POSIX/lib/POSIX.pm
M       locale.c
M       mg.c
M       pp_sys.c
M       sv.c
M       toke.c

commit 37a6dc0719a5d27e3751b90faaebba494c081696
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 13:35:28 2016 -0600

    Add new synonym 'is_utf8_invariant_string'
    
    This is clearer as to its meaning than the existings 'is_ascii_string'
    and 'is_invariant_string', which are retained for back compat.

M       embed.fnc
M       embed.h
M       proto.h
M       utf8.c
M       utf8.h

commit 63e9be9ce8ec0e8644fa5145e4e6887fbb7b1270
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Aug 23 13:37:10 2016 -0600

    embed.fnc: Replace blanks by tabs
    
    In this file, tabs are the more accepted field delimiter, and having
    them makes it easier to search for particular patterns in it.

M       embed.fnc

commit a195aa6ac0d9a343e9b69504e11e37ef7c8bbf48
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Aug 22 12:28:21 2016 -0600

    utf8.c: Use 'break' instead of 'goto'
    
    The goto is a relic of a previous implementation; 'break' is preferred
    if there isn't a reason to use goto.

M       utf8.c

commit 4d5766d6736dd54a3e3d4493a1f507808e4130c7
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Aug 22 12:25:00 2016 -0600

    is_utf8_string_loc() param should not be NULL
    
    It makes no sense to call this function with a NULL parameter, as the
    whole point of using this function is to set what that param points to.
    If you don't want this, you should be using the similar function that
    doesn't have this parameter.

M       embed.fnc
M       proto.h

commit 371633dbdff825b4c8b35a20fc414481b076d942
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Aug 22 12:21:06 2016 -0600

    Document valid_utf8_to_uvchr() and inline it
    
    This function has been in several releases without problem, and is short
    enough that some compilers can inline it.  This commit also notes that
    the result should not be ignored, and removes the unused pTHX.  The
    function has explicitly been marked as being changeable, and has not bee
    part of the API until now.

M       embed.fnc
M       embed.h
M       inline.h
M       proto.h
M       utf8.c

commit a448d6b39a53bdcd78b069176a2adc3c1d2c10f5
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Aug 22 10:48:55 2016 -0600

    utf8.c: Clarify comments for valid_utf8_to_uvchr()

M       utf8.c

commit 217c687a55991a8de07803144a7f210dd90f73e4
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Aug 22 10:59:48 2016 -0600

    utf8.c: Join EBCDIC/non-EBCDIC code
    
    This was missed in 534752c1d25d7c52c702337927c37e40c4df103d

M       utf8.c

commit 10e7e97296063e39ab24a5e9affb168b5e9d3317
Author: Karl Williamson <k...@cpan.org>
Date:   Fri Aug 26 15:25:20 2016 -0600

    regen/embed.pl: Allow inline funcs to be named Perl_foo
    
    When inlining an existing public function whose name begins with Perl_,
    its best to keep that name, in case someone is calling it that way.
    Prior to this commit, the name had to be changed to S_foo.

M       embed.fnc
M       regen/embed.pl
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to