In perl.git, the branch smoke-me/khw-new_locale has been created
<http://perl5.git.perl.org/perl.git/commitdiff/c810d3974b6a6214b2960c8f740a6cf5551a8ccb?hp=0000000000000000000000000000000000000000>
at c810d3974b6a6214b2960c8f740a6cf5551a8ccb (commit)
- Log -----------------------------------------------------------------
commit c810d3974b6a6214b2960c8f740a6cf5551a8ccb
Author: Karl Williamson <[email protected]>
Date: Fri May 13 11:32:44 2016 -0600
locale.c: Make locale collation predictions adaptive
We try to avoid calling strxfrm() more than needed by predicting its
needed buffer size. This generally works because the size of the
transformed string is roughly linear with the size of the input string.
But the key word here is "roughly". This commit changes things, so that
when we guess low, we change the coefficients in the equation to guess
higher the next time.
M locale.c
commit 43173b444135bb8ddb9f2ba2f9595e3c584c3234
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 14:28:57 2016 -0600
locale.c: Not so aggressive collation memory use guess
On platforms where strxfrm() is not well-behaved, and it fails because
it needs a larger buffer, prior to this commit, the size was doubled
before trying again. This could require a lot of memory on large
inputs. I'm uncomfortable with such a big delta on very large strings.
This commit changes it so it is not so aggressive. Note that this now
only gets called on platforms whose strxfrm() is not well behaved, and I
think the size prediction is better due to a recent commit, and there
isn't really much of a downside in not gobbling up memory so fast.
M locale.c
commit 6fe21238d132628247ce7d434f846361201fe883
Author: Karl Williamson <[email protected]>
Date: Wed May 18 13:18:01 2016 -0600
locale.c: Add some debugging statements
M locale.c
commit 687c16ecaee41111ce89e0e337818fe7c76b8160
Author: Karl Williamson <[email protected]>
Date: Wed May 18 13:17:25 2016 -0600
locale.c: Minor cleanup
This replaces an expression with what I think is an easier to understand
macro, and eliminates a couple of temporary variables that just
cluttered things up.
M locale.c
commit dc743c79a68324dde33e8c15fbcb8e8a610ebf34
Author: Karl Williamson <[email protected]>
Date: Sat May 14 18:23:02 2016 -0600
locale.c: Fix some debugging so will output during init
Because the command line options are currently parsed after the locale
initialization is done, an environment variable is read to allow
debugging of the function that is called to do the initialization.
However, any functions that it calls, prior to this commit, were unaware
of this and so did not output debugging. This commit fixes most of
them.
M locale.c
commit 103486e659d33e140d2f56699d6dfeb950ffba72
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 12:49:36 2016 -0600
mv function from locale.c to mathoms.c
The previous commit causes this function being moved to be just a
wrapper not called in core. Just in case someone is calling it, it is
retained, but moved to mathoms.c
M embed.fnc
M embed.h
M locale.c
M mathoms.c
M proto.h
commit 42abc6a8811fd6ae19908f55f757bb2d645ef2b8
Author: Karl Williamson <[email protected]>
Date: Tue May 17 20:50:55 2016 -0600
Do better locale collation in UTF-8 locales
strxfrm() works reasonably well on some platforms under UTF-8 locales.
It will assume that every string passed to it is in UTF-8. This commit
changes perl to make sure that strxfrm's expectations are met.
Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8
string. And this commit makes sure of that. If the passed string
contains code points representable only in UTF-8, they are changed into
the highest collating code point that doesn't require UTF-8. This
provides seamless operation, as they end up collating after every
non-UTF-8 code point. If two transformed strings compare equal, perl
already uses the un-transformed versions to break ties, and there, these
faked-up strings will collate after everything else, and in code point
order amongst themselves.
M embed.fnc
M embed.h
M embedvar.h
M intrpvar.h
M lib/locale.t
M locale.c
M pod/perldelta.pod
M pod/perllocale.pod
M proto.h
M sv.c
commit 96da17d1204b9a64d4dcec6b3d0c28e2a11152cc
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 13:51:48 2016 -0600
perllocale: Change headings so two aren't identical
Two html anchors in this pod were identical, which isn't a problme
unless you try to link to one of them, as the next commit does
M pod/perllocale.pod
commit 46c91e86354b37faa49ed139b623c19ae188db6c
Author: Karl Williamson <[email protected]>
Date: Tue Apr 12 11:21:40 2016 -0600
Change calculation of locale collation coefficients
Every time a new collation locale is set, two coefficients are calculated
that are used in predicting how much space is needed in the
transformation of a string by strxfrm(). The transformed string is
roughly linear with the the length of the input string, so we are
calcaulating 'm' and 'b' such that
transformed_length = m * input_length + b
Space is allocated based on this prediction. If it is too small, the
strxfrm() will fail, and we will have to increase the allotted amount
and try again. It's better to get the prediction right to avoid
multiple, expensive strxfrm() calls.
Prior to this commit, the calculation was not rigorous, and failed on
some platforms that don't have a fully conforming strxfrm().
This commit changes to not panic if a locale has an apparent defective
collation, but instead silently change to use C-locale collation. It
could be argued that a warning should additionally be raised.
This commit fixes [perl #121734].
M locale.c
M pod/perldelta.pod
commit cf6c3a0bb762b84b9dd0ca6a122583b63c1d52fd
Author: Karl Williamson <[email protected]>
Date: Mon Apr 11 19:11:07 2016 -0600
locale.c: Change algorithm for strxfrm() trials
It's kind of guess work deciding how big a buffer to give to strxfrm().
If you give it too small a one, it will fail. Prior to this commit, the
buffer size was doubled and then strxfrm() was called again, looping
until it worked, or we used too much memory.
Each time a new locale is made, we try to minimize the necessity of
doing this by calculating numbers 'm' and 'b' that can be plugged into
the equation
mx + b
where 'x' is the size of the string passed to strxfrm(). strxfrm() is
roughly linear with respect to its input's length, so this generally
works without us having to do many loops to get a large enough size.
But on many systems, strxfrm(), in failing, returns how much space you
should have given it. On such systems, we can just use that number on
the 2nd try and not have to keep guessing. This commit changes to do
that.
But on other systems this doesn't work. So the original method is
retained if we determine that there are problems with strxfrm(), either
from previous experience, or because using the size returned from the
first trial didn't work
M embedvar.h
M intrpvar.h
M locale.c
commit 994f6b98f8681b73235cea0ebe3fd4bae74d2f1e
Author: Karl Williamson <[email protected]>
Date: Sat Apr 9 20:40:48 2016 -0600
locale.c: Free over-allocated space early
We may over malloc some space in buffers to strxfrm(). This frees it
now instead of waiting for the whole block to be freed sometime later.
This can be a significant amount of memory if the input string to
strxfrm() is long.
M locale.c
commit 50a6130b122ba08db3e66804e3e01f08aff2fe01
Author: Karl Williamson <[email protected]>
Date: Sat Apr 9 20:36:01 2016 -0600
locale.c: White-space only
Outdent and reflow because the previous commit removed an enclosing
block.
M locale.c
commit ebc5a17177abaae46ab89ea299408a4d7a70d0be
Author: Karl Williamson <[email protected]>
Date: Sat Apr 9 15:52:05 2016 -0600
Change mem_collxfrm() algorithm for embedded NULs
Perl uses strxfrm() to handle collation under locale. This C library
function expects a NUL-terminated input string. But Perl accepts
interior NUL characters, so something has to happen so strxfrm() can
handle any Perl string.
Until this commit, what happened was that each NUL-terminated
sub-segment would be individually passed to strxfrm(), with all the
sub-results concatenated together to form the transformation of the
whole string with NULs ignored. This doesn't give the best results. In
the typical simple, but general case, a three character string 'ABC' is
transformed into the following, assuming three weight levels:
A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00
where each superscripted letter indicates the weight at the given level
for the corresponding input character (spaces are added here for
clarity). Each level is separated from the next by a 0x01 byte, and the
whole thing has a trailing NUL.
If 'B' is actually a NUL, what is the most desirable result is for the
NUL to be completely ignored like this:
A¹ C¹ 01 A² C² 01 A³ C³ 00
It would also be ok if it were not ignored at the lowest-priority weight
level,
But the algorithm until this commit would effectively do (in pseudo-code)
strxfrm("A) . strxfrm("C")
generating
A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00
The primary weight of C comes immediately after the teriary weight of A,
but more importantly, a NUL, instead of being ignored, is significant
at the primary weight level, so that "a\0c" would sort before "ab".
Another possible implementation would be to just remove the NULs before
transforming the string. The problem with this method is that it screws
up the context. In some locales, two adjacent characters can behave
differently than if they were separated. For example, a combining mark
following just about anything else.
Unfortunately there is no completely satisfactory implementation without
either implementing our own strxfrm(), or reverse engineering and
parsing the strxfrm() output on the fly. This would be hard because of
the multitude of possible implementations across the platforms Perl runs
on.
This commit changes to do a better job than currently of ignoring NUL at
higher priority weight levels. When it encounters its first embedded
NUL for a given locale, it computes which character in the 0-255 range
sorts earliest. Then, whenever it finds a NUL in the input, it
substitutes this character for the NUL. That means the NUL will sort
earlier than any other character below 256 (and probably any other
character at all). For strings that don't contain this character, this
implementation works perfectly. But there may be mis-sorting for
strings that do contain it. Usually this character will be rare.
I expected that \001 would generally be the lowest sorting code point,
but the gcc versions I tested with UTF-8 locales make it U+00B4, ACUTE
ACCENT. (I made sure that there was no higher code point, beyond the
255, that sorted earlier.)
This code is mostly utf8-ready. A few commits from now will make Perl
properly work with UTF-8, if the platform supports it). But until that
time, this isn't a full implementation; it only looks for the
lowest-sorting UTF-8 invariant code point, where the the UTF8ness
doesn't matter.
M embed.fnc
M embedvar.h
M intrpvar.h
M lib/locale.t
M locale.c
M pod/perldelta.pod
M pod/perllocale.pod
M proto.h
commit 8b972ec0ad4883ff8da759f920e643b64f8a9ff6
Author: Karl Williamson <[email protected]>
Date: Tue May 17 21:53:53 2016 -0600
locale.c: Add, clarify comments
M locale.c
commit 65b7f845ff1dbd28b3708cf71445113b2f25f567
Author: Karl Williamson <[email protected]>
Date: Mon May 16 15:19:14 2016 -0600
Keep track of if collation locale is UTF-8 or not
This will be used in future commits
M embedvar.h
M intrpvar.h
M locale.c
M sv.c
commit 7032ec763e3c3f1de13dbdb190bcfc8cced41588
Author: Karl Williamson <[email protected]>
Date: Mon May 16 15:15:26 2016 -0600
locale.c: Don't use special locale collation for C locale
We can skip all the locale collation calculations if the locale we are
in is C or POSIX.
M locale.c
commit c091e59be6a39f9530c850d509ebdcf20bf59fec
Author: Karl Williamson <[email protected]>
Date: Fri May 13 11:51:55 2016 -0600
lib/locale.t: Don't calculate value unless needed
M lib/locale.t
-----------------------------------------------------------------------
--
Perl5 Master Repository