In perl.git, the branch smoke-me/khw-new_locale has been created

<http://perl5.git.perl.org/perl.git/commitdiff/c810d3974b6a6214b2960c8f740a6cf5551a8ccb?hp=0000000000000000000000000000000000000000>

        at  c810d3974b6a6214b2960c8f740a6cf5551a8ccb (commit)

- Log -----------------------------------------------------------------
commit c810d3974b6a6214b2960c8f740a6cf5551a8ccb
Author: Karl Williamson <[email protected]>
Date:   Fri May 13 11:32:44 2016 -0600

    locale.c: Make locale collation predictions adaptive
    
    We try to avoid calling strxfrm() more than needed by predicting its
    needed buffer size.  This generally works because the size of the
    transformed string is roughly linear with the size of the input string.
    But the key word here is "roughly".  This commit changes things, so that
    when we guess low, we change the coefficients in the equation to guess
    higher the next time.

M       locale.c

commit 43173b444135bb8ddb9f2ba2f9595e3c584c3234
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 14:28:57 2016 -0600

    locale.c: Not so aggressive collation memory use guess
    
    On platforms where  strxfrm() is not well-behaved, and it fails because
    it needs a larger buffer, prior to this commit, the size was doubled
    before trying again.  This could require a lot of memory on large
    inputs.  I'm uncomfortable with such a big delta on very large strings.
    This commit changes it so it is not so aggressive.  Note that this now
    only gets called on platforms whose strxfrm() is not well behaved, and I
    think the size prediction is better due to a recent commit, and there
    isn't really much of a downside in not gobbling up memory so fast.

M       locale.c

commit 6fe21238d132628247ce7d434f846361201fe883
Author: Karl Williamson <[email protected]>
Date:   Wed May 18 13:18:01 2016 -0600

    locale.c: Add some debugging statements

M       locale.c

commit 687c16ecaee41111ce89e0e337818fe7c76b8160
Author: Karl Williamson <[email protected]>
Date:   Wed May 18 13:17:25 2016 -0600

    locale.c: Minor cleanup
    
    This replaces an expression with what I think is an easier to understand
    macro, and eliminates a couple of temporary variables that just
    cluttered things up.

M       locale.c

commit dc743c79a68324dde33e8c15fbcb8e8a610ebf34
Author: Karl Williamson <[email protected]>
Date:   Sat May 14 18:23:02 2016 -0600

    locale.c: Fix some debugging so will output during init
    
    Because the command line options are currently parsed after the locale
    initialization is done, an environment variable is read to allow
    debugging of the function that is called to do the initialization.
    However, any functions that it calls, prior to this commit, were unaware
    of this and so did not output debugging.  This commit fixes most of
    them.

M       locale.c

commit 103486e659d33e140d2f56699d6dfeb950ffba72
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 12:49:36 2016 -0600

    mv function from locale.c to mathoms.c
    
    The previous commit causes this function being moved to be just a
    wrapper not called in core.  Just in case someone is calling it, it is
    retained, but moved to mathoms.c

M       embed.fnc
M       embed.h
M       locale.c
M       mathoms.c
M       proto.h

commit 42abc6a8811fd6ae19908f55f757bb2d645ef2b8
Author: Karl Williamson <[email protected]>
Date:   Tue May 17 20:50:55 2016 -0600

    Do better locale collation in UTF-8 locales
    
    strxfrm() works reasonably well on some platforms under UTF-8 locales.
    It will assume that every string passed to it is in UTF-8.  This commit
    changes perl to make sure that strxfrm's expectations are met.
    
    Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
    string.   And this commit makes sure of that.  If the passed string
    contains code points representable only in UTF-8, they are changed into
    the highest collating code point that doesn't require UTF-8.  This
    provides seamless operation, as they end up collating after every
    non-UTF-8 code point.  If two transformed strings compare equal, perl
    already uses the un-transformed versions to break ties, and there, these
    faked-up strings will collate after everything else, and in code point
    order amongst themselves.

M       embed.fnc
M       embed.h
M       embedvar.h
M       intrpvar.h
M       lib/locale.t
M       locale.c
M       pod/perldelta.pod
M       pod/perllocale.pod
M       proto.h
M       sv.c

commit 96da17d1204b9a64d4dcec6b3d0c28e2a11152cc
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 13:51:48 2016 -0600

    perllocale: Change headings so two aren't identical
    
    Two html anchors in this pod were identical, which isn't a problme
    unless you try to link to one of them, as the next commit does

M       pod/perllocale.pod

commit 46c91e86354b37faa49ed139b623c19ae188db6c
Author: Karl Williamson <[email protected]>
Date:   Tue Apr 12 11:21:40 2016 -0600

    Change calculation of locale collation coefficients
    
    Every time a new collation locale is set, two coefficients are calculated
    that are used in predicting how much space is needed in the
    transformation of a string by strxfrm().  The transformed string is
    roughly linear with the the length of the input string, so we are
    calcaulating 'm' and 'b' such that
    
        transformed_length = m * input_length + b
    
    Space is allocated based on this prediction.  If it is too small, the
    strxfrm() will fail, and we will have to increase the allotted amount
    and try again.  It's better to get the prediction right to avoid
    multiple, expensive strxfrm() calls.
    
    Prior to this commit, the calculation was not rigorous, and failed on
    some platforms that don't have a fully conforming strxfrm().
    
    This commit changes to not panic if a locale has an apparent defective
    collation, but instead silently change to use C-locale collation.  It
    could be argued that a warning should additionally be raised.
    
    This commit fixes [perl #121734].

M       locale.c
M       pod/perldelta.pod

commit cf6c3a0bb762b84b9dd0ca6a122583b63c1d52fd
Author: Karl Williamson <[email protected]>
Date:   Mon Apr 11 19:11:07 2016 -0600

    locale.c: Change algorithm for strxfrm() trials
    
    It's kind of guess work deciding how big a buffer to give to strxfrm().
    If you give it too small a one, it will fail.  Prior to this commit, the
    buffer size was doubled and then strxfrm() was called again, looping
    until it worked, or we used too much memory.
    
    Each time a new locale is made, we try to minimize the necessity of
    doing this by calculating numbers 'm' and 'b' that can be plugged into
    the equation
    
        mx + b
    
    where 'x' is the size of the string passed to strxfrm().  strxfrm() is
    roughly linear with respect to its input's length, so this generally
    works without us having to do many loops to get a large enough size.
    
    But on many systems, strxfrm(), in failing, returns how much space you
    should have given it.  On such systems, we can just use that number on
    the 2nd try and not have to keep guessing.  This commit changes to do
    that.
    
    But on other systems this doesn't work.  So the original method is
    retained if we determine that there are problems with strxfrm(), either
    from previous experience, or because using the size returned from the
    first trial didn't work

M       embedvar.h
M       intrpvar.h
M       locale.c

commit 994f6b98f8681b73235cea0ebe3fd4bae74d2f1e
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 20:40:48 2016 -0600

    locale.c: Free over-allocated space early
    
    We may over malloc some space in buffers to strxfrm().  This frees it
    now instead of waiting for the whole block to be freed sometime later.
    This can be a significant amount of memory if the input string to
    strxfrm() is long.

M       locale.c

commit 50a6130b122ba08db3e66804e3e01f08aff2fe01
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 20:36:01 2016 -0600

    locale.c:  White-space only
    
    Outdent and reflow because the previous commit removed an enclosing
    block.

M       locale.c

commit ebc5a17177abaae46ab89ea299408a4d7a70d0be
Author: Karl Williamson <[email protected]>
Date:   Sat Apr 9 15:52:05 2016 -0600

    Change mem_collxfrm() algorithm for embedded NULs
    
    Perl uses strxfrm() to handle collation under locale.  This C library
    function expects a NUL-terminated input string.  But Perl accepts
    interior NUL characters, so something has to happen so strxfrm() can
    handle any Perl string.
    
    Until this commit, what happened was that each NUL-terminated
    sub-segment would be individually passed to strxfrm(), with all the
    sub-results concatenated together to form the transformation of the
    whole string with NULs ignored.  This doesn't give the best results.  In
    the typical simple, but general case, a three character string 'ABC' is
    transformed into the following, assuming three weight levels:
        A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
    where each superscripted letter indicates the weight at the given level
    for the corresponding input character (spaces are added here for
    clarity).  Each level is separated from the next by a 0x01 byte, and the
    whole thing has a trailing NUL.
    
    If 'B' is actually a NUL, what is the most desirable result is for the
    NUL to be completely ignored like this:
        A¹ C¹ 01 A² C² 01 A³ C³ 00
    It would also be ok if it were not ignored at the lowest-priority weight
    level,
    
    But the algorithm until this commit would effectively do (in pseudo-code)
        strxfrm("A) . strxfrm("C")
    generating
        A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
    
    The primary weight of C comes immediately after the teriary weight of A,
    but more importantly, a NUL, instead of being ignored, is significant
    at the primary weight level, so that "a\0c" would sort before "ab".
    
    Another possible implementation would be to just remove the NULs before
    transforming the string.  The problem with this method is that it screws
    up the context.  In some locales, two adjacent characters can behave
    differently than if they were separated.  For example, a combining mark
    following just about anything else.
    
    Unfortunately there is no completely satisfactory implementation without
    either implementing our own strxfrm(), or reverse engineering and
    parsing the strxfrm() output on the fly.  This would be hard because of
    the multitude of possible implementations across the platforms Perl runs
    on.
    
    This commit changes to do a better job than currently of ignoring NUL at
    higher priority weight levels.  When it encounters its first embedded
    NUL for a given locale, it computes which character in the 0-255 range
    sorts earliest.  Then, whenever it finds a NUL in the input, it
    substitutes this character for the NUL.  That means the NUL will sort
    earlier than any other character below 256 (and probably any other
    character at all).  For strings that don't contain this character, this
    implementation works perfectly.  But there may be mis-sorting for
    strings that do contain it.  Usually this character will be rare.
    
    I expected that \001 would generally be the lowest sorting code point,
    but the gcc versions I tested with UTF-8 locales make it U+00B4, ACUTE
    ACCENT.  (I made sure that there was no higher code point, beyond the
    255, that sorted earlier.)
    
    This code is mostly utf8-ready.  A few commits from now will make Perl
    properly work with UTF-8, if the platform supports it).  But until that
    time, this isn't a full implementation; it only looks for the
    lowest-sorting UTF-8 invariant code point, where the the UTF8ness
    doesn't matter.

M       embed.fnc
M       embedvar.h
M       intrpvar.h
M       lib/locale.t
M       locale.c
M       pod/perldelta.pod
M       pod/perllocale.pod
M       proto.h

commit 8b972ec0ad4883ff8da759f920e643b64f8a9ff6
Author: Karl Williamson <[email protected]>
Date:   Tue May 17 21:53:53 2016 -0600

    locale.c: Add, clarify comments

M       locale.c

commit 65b7f845ff1dbd28b3708cf71445113b2f25f567
Author: Karl Williamson <[email protected]>
Date:   Mon May 16 15:19:14 2016 -0600

    Keep track of if collation locale is UTF-8 or not
    
    This will be used in future commits

M       embedvar.h
M       intrpvar.h
M       locale.c
M       sv.c

commit 7032ec763e3c3f1de13dbdb190bcfc8cced41588
Author: Karl Williamson <[email protected]>
Date:   Mon May 16 15:15:26 2016 -0600

    locale.c: Don't use special locale collation for C locale
    
    We can skip all the locale collation calculations if the locale we are
    in is C or POSIX.

M       locale.c

commit c091e59be6a39f9530c850d509ebdcf20bf59fec
Author: Karl Williamson <[email protected]>
Date:   Fri May 13 11:51:55 2016 -0600

    lib/locale.t: Don't calculate value unless needed

M       lib/locale.t
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to