In perl.git, the branch smoke-me/khw-new_locale has been created <http://perl5.git.perl.org/perl.git/commitdiff/afbd0374e5467934a57030136eb9a3132798cb54?hp=0000000000000000000000000000000000000000>
at afbd0374e5467934a57030136eb9a3132798cb54 (commit) - Log ----------------------------------------------------------------- commit afbd0374e5467934a57030136eb9a3132798cb54 Author: Karl Williamson <k...@cpan.org> Date: Thu May 19 20:03:06 2016 -0600 temp debug M locale.c commit 13cd9f3403c64ca05427bff7e0bf643b0c014009 Author: Karl Williamson <k...@cpan.org> Date: Fri May 13 11:32:44 2016 -0600 locale.c: Make locale collation predictions adaptive We try to avoid calling strxfrm() more than needed by predicting its needed buffer size. This generally works because the size of the transformed string is roughly linear with the size of the input string. But the key word here is "roughly". This commit changes things, so that when we guess low, we change the coefficients in the equation to guess higher the next time. M locale.c commit 3778a9185f5bde6077feab14f748a8b1323cece0 Author: Karl Williamson <k...@cpan.org> Date: Tue Apr 12 14:28:57 2016 -0600 locale.c: Not so aggressive collation memory use guess On platforms where strxfrm() is not well-behaved, and it fails because it needs a larger buffer, prior to this commit, the size was doubled before trying again. This could require a lot of memory on large inputs. I'm uncomfortable with such a big delta on very large strings. This commit changes it so it is not so aggressive. Note that this now only gets called on platforms whose strxfrm() is not well behaved, and I think the size prediction is better due to a recent commit, and there isn't really much of a downside in not gobbling up memory so fast. M locale.c commit f7f17744dba2e167364f09739d164c7f62a75b56 Author: Karl Williamson <k...@cpan.org> Date: Wed May 18 13:18:01 2016 -0600 locale.c: Add some debugging statements M locale.c commit e0c750a24be340515c407dd1ffb90adc8e16affa Author: Karl Williamson <k...@cpan.org> Date: Wed May 18 13:17:25 2016 -0600 locale.c: Minor cleanup This replaces an expression with what I think is an easier to understand macro, and eliminates a couple of temporary variables that just cluttered things up. M locale.c commit b5e170845830f04f818a72e04a6ba2e8ac8fd290 Author: Karl Williamson <k...@cpan.org> Date: Sat May 14 18:23:02 2016 -0600 locale.c: Fix some debugging so will output during init Because the command line options are currently parsed after the locale initialization is done, an environment variable is read to allow debugging of the function that is called to do the initialization. However, any functions that it calls, prior to this commit, were unaware of this and so did not output debugging. This commit fixes most of them. M locale.c commit ea4b52b79529272e1817230f1c46046680cad665 Author: Karl Williamson <k...@cpan.org> Date: Tue Apr 12 12:49:36 2016 -0600 mv function from locale.c to mathoms.c The previous commit causes this function being moved to be just a wrapper not called in core. Just in case someone is calling it, it is retained, but moved to mathoms.c M embed.fnc M embed.h M locale.c M mathoms.c M proto.h commit 0f664b45741ec7ece2f83a3fb9e7e6666d9ba446 Author: Karl Williamson <k...@cpan.org> Date: Tue May 17 20:50:55 2016 -0600 Do better locale collation in UTF-8 locales strxfrm() works reasonably well on some platforms under UTF-8 locales. It will assume that every string passed to it is in UTF-8. This commit changes perl to make sure that strxfrm's expectations are met. Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8 string. And this commit makes sure of that. If the passed string contains code points representable only in UTF-8, they are changed into the highest collating code point that doesn't require UTF-8. This provides seamless operation, as they end up collating after every non-UTF-8 code point. If two transformed strings compare equal, perl already uses the un-transformed versions to break ties, and there, these faked-up strings will collate after everything else, and in code point order amongst themselves. M embed.fnc M embed.h M embedvar.h M intrpvar.h M lib/locale.t M locale.c M pod/perldelta.pod M pod/perllocale.pod M proto.h M sv.c commit 98c0717af374827c0c1f6b48a9345af3a1388945 Author: Karl Williamson <k...@cpan.org> Date: Tue Apr 12 13:51:48 2016 -0600 perllocale: Change headings so two aren't identical Two html anchors in this pod were identical, which isn't a problme unless you try to link to one of them, as the next commit does M pod/perllocale.pod commit 865ce47b5b3f15d73f16f00e8cadb72e2a41fc37 Author: Karl Williamson <k...@cpan.org> Date: Tue Apr 12 11:21:40 2016 -0600 Change calculation of locale collation coefficients Every time a new collation locale is set, two coefficients are calculated that are used in predicting how much space is needed in the transformation of a string by strxfrm(). The transformed string is roughly linear with the the length of the input string, so we are calcaulating 'm' and 'b' such that transformed_length = m * input_length + b Space is allocated based on this prediction. If it is too small, the strxfrm() will fail, and we will have to increase the allotted amount and try again. It's better to get the prediction right to avoid multiple, expensive strxfrm() calls. Prior to this commit, the calculation was not rigorous, and failed on some platforms that don't have a fully conforming strxfrm(). This commit changes to not panic if a locale has an apparent defective collation, but instead silently change to use C-locale collation. It could be argued that a warning should additionally be raised. This commit fixes [perl #121734]. M locale.c M pod/perldelta.pod commit 3dc247eea6bdd7faf3324dac8fdc3a153ccaef83 Author: Karl Williamson <k...@cpan.org> Date: Mon Apr 11 19:11:07 2016 -0600 locale.c: Change algorithm for strxfrm() trials It's kind of guess work deciding how big a buffer to give to strxfrm(). If you give it too small a one, it will fail. Prior to this commit, the buffer size was doubled and then strxfrm() was called again, looping until it worked, or we used too much memory. Each time a new locale is made, we try to minimize the necessity of doing this by calculating numbers 'm' and 'b' that can be plugged into the equation mx + b where 'x' is the size of the string passed to strxfrm(). strxfrm() is roughly linear with respect to its input's length, so this generally works without us having to do many loops to get a large enough size. But on many systems, strxfrm(), in failing, returns how much space you should have given it. On such systems, we can just use that number on the 2nd try and not have to keep guessing. This commit changes to do that. But on other systems this doesn't work. So the original method is retained if we determine that there are problems with strxfrm(), either from previous experience, or because using the size returned from the first trial didn't work M embedvar.h M intrpvar.h M locale.c commit 811bfb90f56d922c9378389ec40c45d7fb1e8e02 Author: Karl Williamson <k...@cpan.org> Date: Sat Apr 9 20:40:48 2016 -0600 locale.c: Free over-allocated space early We may over malloc some space in buffers to strxfrm(). This frees it now instead of waiting for the whole block to be freed sometime later. This can be a significant amount of memory if the input string to strxfrm() is long. M locale.c commit 95e2db4d7bdb04167c0b816519b0e716c8c6c43b Author: Karl Williamson <k...@cpan.org> Date: Sat Apr 9 20:36:01 2016 -0600 locale.c: White-space only Outdent and reflow because the previous commit removed an enclosing block. M locale.c commit e5bd1c7aadf571974fd3922b79568f73f622f106 Author: Karl Williamson <k...@cpan.org> Date: Sat Apr 9 15:52:05 2016 -0600 XXX pod, left in generality Change mem_collxfrm() algorithm for embedded NULs One of the problems in implementing Perl is that the C library routines forbid embedded NUL characters, which Perl accepts. This is true for the case of strxfrm() which handles collation under locale. The best solution as far as functionality goes, would be for Perl to write its own strxfrm replacement which would handle the specific needs of Perl. But that is not going to happen because of the huge complexity in handling it across many platforms. We would have to know the location and format of the locale definition files for every such platform. Some might follow POSIX guidelines, some might not. strxfrm creates a transformation of its input into a new string consisting of weight bytes. In the typical but general case, a 3 character NUL-terminated input string 'A B C 00' (spaces added for readability) gets transformed into something like: A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00 where the superscripted characters are weights for the corresponding input characters. Superscript 1 represents the primary sorting key; 2, the secondary, etc, for as many levels as the locale definition gives. The 01 byte is likely to be the separator between levels, but not necessarily, and there could be some other mechanisms used on various platforms. To handle embedded NULs, the simplest thing would be to just remove them before passing in to strxfrm(). Then they would be entirely ignored, which might not be what you want. You might want them to have some weight at the tertiary level, for example. It also causes problems because strxfrm is very context sensitive. The locale definition can define weights for specific sequences of any length (and the weights can be multi-byte), and by removing a NUL, two characters now become adjacent that weren't in the input, and they could now form one of those special sequences and thus throw things off. Another way to handle NULs, that seemingly ignores them, but actually doesn't, is the mechanism in use prior to this commit. The input string is split at the NULs, and the substrings are independently passed to strxfrm, and the results concatenated together. This doesn't work either. In our example 'A B C 00', suppose B is a NUL, and should have some weight at the tertiary level. What we want is: A¹ C¹ 01 A² C² 01 A³ B³ C³ 00 But that's not at all what you get. Instead it is: A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00 The primary weight of C comes immediately after the teriary weight of A, but more importantly, a NUL, instead of being ignored at the primary levels, is significant at all levels, so that "a\0c" would sort before "ab". Still another possibility is to replace the NUL with some other character before passing it to strxfrm. That was my original plan, to replace each NUL with the character that this code determines has the lowest collation order for the current locale. On strings that don't contain that character, the results would be as good as it gets for that locale. That character is likely to be ignored at higher weight levels, but have some small non-ignored weight at the lowest ones. And hopefully the character would rarely be encountered in practice. When it does happen, it and NUL would sort identically; hardly the end of the world. If the entire strings sorted identically, the NUL-containing one would come out before the other one, since the original Perl strings are used as a tie breaker. However, testing showed a problem with this. If that other character is part of a sequence that has special weighting, the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the lowest collating character in many UTF-8 locales. It combines in Romanian and Vietnamese with some other characters to change weights, and hence changing NULs into it screws things up. What I finally have come to is to do is a modification of this final approach, where the possible NUL replacements are limited to just characters that are controls in the locale. NULs are replaced by the lowest collating control. It would really be a defective locale if this control combined with some other character to form a special sequence. Often the character will be a 01, START OF HEADING. In the very unlikely case that there are absolutely no controls in the locale, 01 is used, because SOMETHING has to be. The code added by this commit is mostly utf8-ready. A few commits from now will make Perl properly work with UTF-8 (if the platform supports it). But until that time, this isn't a full implementation; it only looks for the lowest-sorting control that is invariant, where the the UTF8ness doesn't matter. M embed.fnc M embedvar.h M intrpvar.h M lib/locale.t M locale.c M pod/perldelta.pod M pod/perllocale.pod M proto.h commit f43d6f7b36558c0818c7bcc3ef00b21704bc9363 Author: Karl Williamson <k...@cpan.org> Date: Tue May 17 21:53:53 2016 -0600 locale.c: Add, move, clarify comments This moves a large block of comments to before a block, outdents it, and adds to it, plus adding another comment M locale.c commit 66b48b5d307f53df9ca8e631c6146864e12797c1 Author: Karl Williamson <k...@cpan.org> Date: Mon May 16 15:19:14 2016 -0600 Keep track of if collation locale is UTF-8 or not This will be used in future commits M embedvar.h M intrpvar.h M locale.c M sv.c commit 022c3cea46bc737580fc97e28ff875dc7628afc6 Author: Karl Williamson <k...@cpan.org> Date: Mon May 16 15:15:26 2016 -0600 locale.c: Don't use special locale collation for C locale We can skip all the locale collation calculations if the locale we are in is C or POSIX. M locale.c commit 968e5483b78567e1dc71642588edad96c8614efe Author: Karl Williamson <k...@cpan.org> Date: Fri May 13 11:51:55 2016 -0600 lib/locale.t: Don't calculate value unless needed M lib/locale.t ----------------------------------------------------------------------- -- Perl5 Master Repository