In perl.git, the branch smoke-me/khw-new_locale has been created

<http://perl5.git.perl.org/perl.git/commitdiff/afbd0374e5467934a57030136eb9a3132798cb54?hp=0000000000000000000000000000000000000000>

        at  afbd0374e5467934a57030136eb9a3132798cb54 (commit)

- Log -----------------------------------------------------------------
commit afbd0374e5467934a57030136eb9a3132798cb54
Author: Karl Williamson <k...@cpan.org>
Date:   Thu May 19 20:03:06 2016 -0600

    temp debug

M       locale.c

commit 13cd9f3403c64ca05427bff7e0bf643b0c014009
Author: Karl Williamson <k...@cpan.org>
Date:   Fri May 13 11:32:44 2016 -0600

    locale.c: Make locale collation predictions adaptive
    
    We try to avoid calling strxfrm() more than needed by predicting its
    needed buffer size.  This generally works because the size of the
    transformed string is roughly linear with the size of the input string.
    But the key word here is "roughly".  This commit changes things, so that
    when we guess low, we change the coefficients in the equation to guess
    higher the next time.

M       locale.c

commit 3778a9185f5bde6077feab14f748a8b1323cece0
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Apr 12 14:28:57 2016 -0600

    locale.c: Not so aggressive collation memory use guess
    
    On platforms where  strxfrm() is not well-behaved, and it fails because
    it needs a larger buffer, prior to this commit, the size was doubled
    before trying again.  This could require a lot of memory on large
    inputs.  I'm uncomfortable with such a big delta on very large strings.
    This commit changes it so it is not so aggressive.  Note that this now
    only gets called on platforms whose strxfrm() is not well behaved, and I
    think the size prediction is better due to a recent commit, and there
    isn't really much of a downside in not gobbling up memory so fast.

M       locale.c

commit f7f17744dba2e167364f09739d164c7f62a75b56
Author: Karl Williamson <k...@cpan.org>
Date:   Wed May 18 13:18:01 2016 -0600

    locale.c: Add some debugging statements

M       locale.c

commit e0c750a24be340515c407dd1ffb90adc8e16affa
Author: Karl Williamson <k...@cpan.org>
Date:   Wed May 18 13:17:25 2016 -0600

    locale.c: Minor cleanup
    
    This replaces an expression with what I think is an easier to understand
    macro, and eliminates a couple of temporary variables that just
    cluttered things up.

M       locale.c

commit b5e170845830f04f818a72e04a6ba2e8ac8fd290
Author: Karl Williamson <k...@cpan.org>
Date:   Sat May 14 18:23:02 2016 -0600

    locale.c: Fix some debugging so will output during init
    
    Because the command line options are currently parsed after the locale
    initialization is done, an environment variable is read to allow
    debugging of the function that is called to do the initialization.
    However, any functions that it calls, prior to this commit, were unaware
    of this and so did not output debugging.  This commit fixes most of
    them.

M       locale.c

commit ea4b52b79529272e1817230f1c46046680cad665
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Apr 12 12:49:36 2016 -0600

    mv function from locale.c to mathoms.c
    
    The previous commit causes this function being moved to be just a
    wrapper not called in core.  Just in case someone is calling it, it is
    retained, but moved to mathoms.c

M       embed.fnc
M       embed.h
M       locale.c
M       mathoms.c
M       proto.h

commit 0f664b45741ec7ece2f83a3fb9e7e6666d9ba446
Author: Karl Williamson <k...@cpan.org>
Date:   Tue May 17 20:50:55 2016 -0600

    Do better locale collation in UTF-8 locales
    
    strxfrm() works reasonably well on some platforms under UTF-8 locales.
    It will assume that every string passed to it is in UTF-8.  This commit
    changes perl to make sure that strxfrm's expectations are met.
    
    Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
    string.   And this commit makes sure of that.  If the passed string
    contains code points representable only in UTF-8, they are changed into
    the highest collating code point that doesn't require UTF-8.  This
    provides seamless operation, as they end up collating after every
    non-UTF-8 code point.  If two transformed strings compare equal, perl
    already uses the un-transformed versions to break ties, and there, these
    faked-up strings will collate after everything else, and in code point
    order amongst themselves.

M       embed.fnc
M       embed.h
M       embedvar.h
M       intrpvar.h
M       lib/locale.t
M       locale.c
M       pod/perldelta.pod
M       pod/perllocale.pod
M       proto.h
M       sv.c

commit 98c0717af374827c0c1f6b48a9345af3a1388945
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Apr 12 13:51:48 2016 -0600

    perllocale: Change headings so two aren't identical
    
    Two html anchors in this pod were identical, which isn't a problme
    unless you try to link to one of them, as the next commit does

M       pod/perllocale.pod

commit 865ce47b5b3f15d73f16f00e8cadb72e2a41fc37
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Apr 12 11:21:40 2016 -0600

    Change calculation of locale collation coefficients
    
    Every time a new collation locale is set, two coefficients are calculated
    that are used in predicting how much space is needed in the
    transformation of a string by strxfrm().  The transformed string is
    roughly linear with the the length of the input string, so we are
    calcaulating 'm' and 'b' such that
    
        transformed_length = m * input_length + b
    
    Space is allocated based on this prediction.  If it is too small, the
    strxfrm() will fail, and we will have to increase the allotted amount
    and try again.  It's better to get the prediction right to avoid
    multiple, expensive strxfrm() calls.
    
    Prior to this commit, the calculation was not rigorous, and failed on
    some platforms that don't have a fully conforming strxfrm().
    
    This commit changes to not panic if a locale has an apparent defective
    collation, but instead silently change to use C-locale collation.  It
    could be argued that a warning should additionally be raised.
    
    This commit fixes [perl #121734].

M       locale.c
M       pod/perldelta.pod

commit 3dc247eea6bdd7faf3324dac8fdc3a153ccaef83
Author: Karl Williamson <k...@cpan.org>
Date:   Mon Apr 11 19:11:07 2016 -0600

    locale.c: Change algorithm for strxfrm() trials
    
    It's kind of guess work deciding how big a buffer to give to strxfrm().
    If you give it too small a one, it will fail.  Prior to this commit, the
    buffer size was doubled and then strxfrm() was called again, looping
    until it worked, or we used too much memory.
    
    Each time a new locale is made, we try to minimize the necessity of
    doing this by calculating numbers 'm' and 'b' that can be plugged into
    the equation
    
        mx + b
    
    where 'x' is the size of the string passed to strxfrm().  strxfrm() is
    roughly linear with respect to its input's length, so this generally
    works without us having to do many loops to get a large enough size.
    
    But on many systems, strxfrm(), in failing, returns how much space you
    should have given it.  On such systems, we can just use that number on
    the 2nd try and not have to keep guessing.  This commit changes to do
    that.
    
    But on other systems this doesn't work.  So the original method is
    retained if we determine that there are problems with strxfrm(), either
    from previous experience, or because using the size returned from the
    first trial didn't work

M       embedvar.h
M       intrpvar.h
M       locale.c

commit 811bfb90f56d922c9378389ec40c45d7fb1e8e02
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Apr 9 20:40:48 2016 -0600

    locale.c: Free over-allocated space early
    
    We may over malloc some space in buffers to strxfrm().  This frees it
    now instead of waiting for the whole block to be freed sometime later.
    This can be a significant amount of memory if the input string to
    strxfrm() is long.

M       locale.c

commit 95e2db4d7bdb04167c0b816519b0e716c8c6c43b
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Apr 9 20:36:01 2016 -0600

    locale.c:  White-space only
    
    Outdent and reflow because the previous commit removed an enclosing
    block.

M       locale.c

commit e5bd1c7aadf571974fd3922b79568f73f622f106
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Apr 9 15:52:05 2016 -0600

    XXX pod, left in generality Change mem_collxfrm() algorithm for embedded 
NULs
    
    One of the problems in implementing Perl is that the C library routines
    forbid embedded NUL characters, which Perl accepts.  This is true for
    the case of strxfrm() which handles collation under locale.
    
    The best solution as far as functionality goes, would be for Perl to
    write its own strxfrm replacement which would handle the specific needs
    of Perl.  But that is not going to happen because of the huge complexity
    in handling it across many platforms.  We would have to know the
    location and format of the locale definition files for every such
    platform.  Some might follow POSIX guidelines, some might not.
    
    strxfrm creates a transformation of its input into a new string
    consisting of weight bytes.  In the typical but general case, a 3
    character NUL-terminated input string 'A B C 00' (spaces added for
    readability) gets transformed into something like:
        A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
    where the superscripted characters are weights for the corresponding
    input characters.  Superscript 1 represents the primary sorting key; 2,
    the secondary, etc, for as many levels as the locale definition gives.
    The 01 byte is likely to be the separator between levels, but not
    necessarily, and there could be some other mechanisms used on various
    platforms.
    
    To handle embedded NULs, the simplest thing would be to just remove them
    before passing in to strxfrm().  Then they would be entirely ignored,
    which might not be what you want.  You might want them to have some
    weight at the tertiary level, for example.  It also causes problems
    because strxfrm is very context sensitive.  The locale definition can
    define weights for specific sequences of any length (and the weights can
    be multi-byte), and by removing a NUL, two characters now become
    adjacent that weren't in the input, and they could now form one of those
    special sequences and thus throw things off.
    
    Another way to handle NULs, that seemingly ignores them, but actually
    doesn't, is the mechanism in use prior to this commit.  The input string
    is split at the NULs, and the substrings are independently passed to
    strxfrm, and the results concatenated together.  This doesn't work
    either.  In our example 'A B C 00', suppose B is a NUL, and should have
    some weight at the tertiary level.  What we want is:
        A¹ C¹ 01 A² C² 01 A³ B³ C³ 00
    
    But that's not at all what you get.  Instead it is:
        A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
    The primary weight of C comes immediately after the teriary weight of A,
    but more importantly, a NUL, instead of being ignored at the primary
    levels, is significant at all levels, so that "a\0c" would sort before
    "ab".
    
    Still another possibility is to replace the NUL with some other
    character before passing it to strxfrm.  That was my original plan, to
    replace each NUL with the character that this code determines has the
    lowest collation order for the current locale.  On strings that don't
    contain that character, the results would be as good as it gets for that
    locale.  That character is likely to be ignored at higher weight levels,
    but have some small non-ignored weight at the lowest ones.  And
    hopefully the character would rarely be encountered in practice.  When
    it does happen, it and NUL would sort identically; hardly the end of the
    world.  If the entire strings sorted identically, the NUL-containing one
    would come out before the other one, since the original Perl strings are
    used as a tie breaker.  However, testing showed a problem with this.  If
    that other character is part of a sequence that has special weighting,
    the results won't be correct.  With gcc, U+00B4 ACUTE ACCENT is the
    lowest collating character in many UTF-8 locales.  It combines in
    Romanian and Vietnamese with some other characters to change weights,
    and hence changing NULs into it screws things up.
    
    What I finally have come to is to do is a modification of this final
    approach, where the possible NUL replacements are limited to just
    characters that are controls in the locale.  NULs are replaced by the
    lowest collating control.  It would really be a defective locale if this
    control combined with some other character to form a special sequence.
    Often the character will be a 01, START OF HEADING.  In the very
    unlikely case that there are absolutely no controls in the locale, 01 is
    used, because SOMETHING has to be.
    
    The code added by this commit is mostly utf8-ready.  A few commits from
    now will make Perl properly work with UTF-8 (if the platform supports
    it).  But until that time, this isn't a full implementation; it only
    looks for the lowest-sorting control that is invariant, where the
    the UTF8ness doesn't matter.

M       embed.fnc
M       embedvar.h
M       intrpvar.h
M       lib/locale.t
M       locale.c
M       pod/perldelta.pod
M       pod/perllocale.pod
M       proto.h

commit f43d6f7b36558c0818c7bcc3ef00b21704bc9363
Author: Karl Williamson <k...@cpan.org>
Date:   Tue May 17 21:53:53 2016 -0600

    locale.c: Add, move, clarify comments
    
    This moves a large block of comments to before a block, outdents it, and
    adds to it, plus adding another comment

M       locale.c

commit 66b48b5d307f53df9ca8e631c6146864e12797c1
Author: Karl Williamson <k...@cpan.org>
Date:   Mon May 16 15:19:14 2016 -0600

    Keep track of if collation locale is UTF-8 or not
    
    This will be used in future commits

M       embedvar.h
M       intrpvar.h
M       locale.c
M       sv.c

commit 022c3cea46bc737580fc97e28ff875dc7628afc6
Author: Karl Williamson <k...@cpan.org>
Date:   Mon May 16 15:15:26 2016 -0600

    locale.c: Don't use special locale collation for C locale
    
    We can skip all the locale collation calculations if the locale we are
    in is C or POSIX.

M       locale.c

commit 968e5483b78567e1dc71642588edad96c8614efe
Author: Karl Williamson <k...@cpan.org>
Date:   Fri May 13 11:51:55 2016 -0600

    lib/locale.t: Don't calculate value unless needed

M       lib/locale.t
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to