Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
On Fri, Feb 26, 2010 at 12:38:08AM -0600, Jonathan Nieder wrote: Computers are dumb -- Andras wrote: 1. grep has no way of knowing whether a zs sequence is a single letter or two letters, because the combination can occur in compound words without becoming a zs letter; for example, in fúvószenekar (fúvós + zenekar), it's simply an s and a z letter next to each other. There may even exist words that make (a different) sense either way, but I can't think of any right now. Are there simple heuristics that would make this condition easy to discover? For example, vowels that would never appear before a true sz letter, things like that? I am just curious; please feel free to e-mail me privately about this. This sounds like a (hard to fix) bug in the collation algorithm, but not a reason not to make 'sort' follow the conventions of the language. Sorting is actually also tricky with dumb computers, because there is no way for sort to know whether e.g. nyolcszáz contains a cs collating symbol followed by z or a c followed by an sz collating symbol (the latter is in fact the case). cs+z would be sorted after cz (because cs comes after c), but c+sz would be sorted _before_ cz because sz precedes z. I'd say this is unfixable. There is no way, short of understanding the natural language, for a program to determine whether two (or three) characters represent a single collating symbol or themselves. Clearly, the Hungarian language must be fixed, either by introducing separate glyphs for the composite letters, or by no longer insisting that something represented by more than one character is a single letter. :) Andras -- If Chuck Norris had been Spartan, the movie would simply have been called 1. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
Hi again, Odd names for collating elements I wrote: $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.ch.]]/MATCHED/' sed: -e expression #1, char 21: Invalid collation character Odd, no? It did seem odd, especially since the POSIX documentation uses examples like this all the time (usually [.ch.] from pre-1994 Spanish). For example [1]: collating-element ch from ch collating-element e-acute from acutee collating-element ll from ll I was missing something obvious: in GNU locales, the collating element has a hyphenated name. collating-symbol zs collating-element z-s from U007AU0073 $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.c-h.]]/MATCHED/' MATCHED and more So there’s the workaround. I think this is a real bug: POSIX 1.2008 says [2]: A collating symbol is a collating element enclosed within bracket-period ( [. and .] ) delimiters. Collating elements are defined as described in Collation Order . Conforming applications shall represent multi-character collating elements as collating symbols when it is necessary to distinguish them from a list of the individual characters that make up the multi-character collating element. For example, if the string ch is a collating element defined using the line: collating-element ch-digraph from ch in the locale definition, the expression [[.ch.]] shall be treated as an RE containing the collating symbol 'ch', while [ch] shall be treated as an RE matching 'c' or 'h' . Collating symbols are recognized only inside bracket expressions. If the string is not a collating element in the current locale, the expression is invalid. In other words, in the “collating-element z-s from U007AU0073” line, it is not the z-s that names the collating symbol in regexps. This makes sense, since otherwise how could anyone write portable regular expressions? Writing [:alpha:] in Hungarian -- Andras wrote: zs in particular is causing trouble for grep: % echo zs | LANG=C grep '^[^a-z]*$' % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$' zs Any program using such constructions without LC_COLLATE=C or similar is IMHO buggy because of exactly this problem. With some C libraries (though not current glibc, luckily), in English, [^a-z] matches A but not Z or vice versa [3]. (Current POSIX leaves the behavior unspecified.) The . notation seems to work here: % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-[.z-s.]]*$' % Once the regexp engine is fixed, that regexp would become '^[^a-[.zs.]]*$'. Bracket expressions match collating elements Andras wrote: % echo ty | LANG=C grep '^[s-u]*$' % echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$' ty POSIX is unambiguous about this: bracket expressions match collating elements, not characters. I can imagine situations where this would be helpful and situations where it would be unhelpful. Mostly, it just seems difficult to do any other way, since otherwise what would the ranges mean? The simplest workaround is to use LC_COLLATE=C (or en_US.UTF-8, or C.UTF-8 once glibc learns that, or whatever locale has the behavior you want). Computers are dumb -- Andras wrote: 1. grep has no way of knowing whether a zs sequence is a single letter or two letters, because the combination can occur in compound words without becoming a zs letter; for example, in fúvószenekar (fúvós + zenekar), it's simply an s and a z letter next to each other. There may even exist words that make (a different) sense either way, but I can't think of any right now. Are there simple heuristics that would make this condition easy to discover? For example, vowels that would never appear before a true sz letter, things like that? I am just curious; please feel free to e-mail me privately about this. This sounds like a (hard to fix) bug in the collation algorithm, but not a reason not to make 'sort' follow the conventions of the language. An argument could be made that although 'sort' should use the customary collation order, regexp matching should not. The strongest counterargument I know of is that it is hard to find a different rule that would be useful for regular expressions in, e.g., Hebrew. . matches a character - Andras wrote: % echo zs | LANG=hu_HU.UTF-8 grep ^[a-z]*$ zs % echo azsa | LANG=hu_HU.UTF-8 grep ^a.a$ % echo azsa | LANG=hu_HU.UTF-8 grep ^a[^a-z]a$ azsa POSIX is unambiguous about this, too: . matches a single character, not a collating element. I assume this is mostly for speed. If you want to match an arbitrary collating element, it is not obvious to me how to. [[:print:]] would capture the most important ones. A related collating element bug --- There is some other ugliness: any single-byte character like [.e.] works
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
On Tue, Feb 23, 2010 at 10:29:25PM -0600, Jonathan Nieder wrote: 2. zs is the last letter of the Hungarian alphabet; therefore, no sane character range in a regular expression can include it ([a-zs] would be ambiguous because there isn't a zs glyph). Would [a-[.zs.]] work? ̈́No, because apparently [.zs.] isn't a valid collating element: % echo azsa | LANG=hu_HU.UTF-8 grep ^a[a-[.zs.]]a$ grep: Invalid collation character See http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 That was helpful, thanks - I didn't know about collating elements in REs. Lots of the behavior of regular expressions in non-C locales is counterintuitive, so it might be helpful to point out if each example violates some rule of the standard or only common sense (both are important, of course). Uh, that standard is too dense for me; I'll pass on that and can only vouch for common sense. Andras -- Andras Korn korn at elan.rulez.org - http://chardonnay.math.bme.hu/~korn/ My new year's resolution is 1920x1080. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
On Wed, Feb 24, 2010 at 10:13:09AM +0100, Andras Korn wrote: ??No, because apparently [.zs.] isn't a valid collating element: Should it be? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
Clint Adams wrote: On Wed, Feb 24, 2010 at 10:13:09AM +0100, Andras Korn wrote: No, because apparently [.zs.] isn't a valid collating element: Should it be? Yes, I think so: it comes after z in alphabetical order. See http://lists.mysql.com/mysql/204718 for example. glibc thinks so too, AFAICT. From localedata/locales/hu_HU: collating-symbol zs collating-element Z-S from U005AU0053 collating-element Z-s from U005AU0073 collating-element z-S from U007AU0053 collating-element z-s from U007AU0073 collating-element Z-Z-S from U005AU005AU0053 collating-element Z-Z-s from U005AU005AU0073 collating-element Z-z-S from U005AU007AU0053 collating-element Z-z-s from U005AU007AU0073 collating-element z-Z-S from U007AU005AU0053 collating-element z-Z-s from U007AU005AU0073 collating-element z-z-S from U007AU007AU0053 collating-element z-z-s from U007AU007AU0073 Anyway, I decided to try a collating element from another language. ch is a single letter for collation in Welsh. $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/./MATCHED/' MATCHEDh and more $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[^a]/MATCHED/' MATCHED and more $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.ch.]]/MATCHED/' sed: -e expression #1, char 21: Invalid collation character Odd, no? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
On Mon, Feb 22, 2010 at 11:07:21AM +0100, Andras Korn wrote: 1. grep has no way of knowing whether a zs sequence is a single letter or two letters, because the combination can occur in compound words without becoming a zs letter; for example, in fúvószenekar (fúvós + zenekar), it's simply an s and a z letter next to each other. There may even exist words that make (a different) sense either way, but I can't think of any right now. Uh, sorry, wrong example (sz instead of zs). Some examples for zs are község, egészség (especially interesting because it contains an sz followed by an s, not an s followed by a zs), gazság etc. 2. zs is the last letter of the Hungarian alphabet; therefore, no sane character range in a regular expression can include it ([a-zs] would be ambiguous because there isn't a zs glyph). It actually gets even more confusing, because grep's behaviour is inconsistent: % echo zs | LANG=hu_HU.UTF-8 grep ^[a-z]*$ zs % echo azsa | LANG=hu_HU.UTF-8 grep ^a.a$ % echo azsa | LANG=hu_HU.UTF-8 grep ^a[^a-z]a$ azsa So is zs a member of the [a-z] class or not? The first attempt matches z and s individually, because zs doesn't match . (as shown in the second example). However, in the last example, zs matches [^a-z], which is also only supposed to match a single character. The problem also affects sed(1) similarly: % echo azsa | LANG=hu_HU.UTF-8 sed -n /^a[^a-z]a$/p azsa Therefore, I believe this is a bug in locales, not grep. Andras -- Andras Korn korn at elan.rulez.org - http://chardonnay.math.bme.hu/~korn/ Never say 'OOPS!' Always say 'Ah, Interesting!' -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
Hi, I have no clue about the rest of these, but Andras Korn wrote: On Mon, Feb 22, 2010 at 11:07:21AM +0100, Andras Korn wrote: 2. zs is the last letter of the Hungarian alphabet; therefore, no sane character range in a regular expression can include it ([a-zs] would be ambiguous because there isn't a zs glyph). Would [a-[.zs.]] work? See http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 Lots of the behavior of regular expressions in non-C locales is counterintuitive, so it might be helpful to point out if each example violates some rule of the standard or only common sense (both are important, of course). The problem also affects sed(1) similarly: % echo azsa | LANG=hu_HU.UTF-8 sed -n /^a[^a-z]a$/p azsa sed uses re_compile_pattern() and so on from glibc (same maintainers as locales). I don’t know if grep does also. Hope that helps, Jonathan -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences
Package: locales Version: 2.10.2-6 Severity: normal Hi, in Hungarian, zs (as well as sz, cs, ty, dz, dzs, gy and ly) are said to be part of the alphabet and each combination is considered to be a single letter; however, they are represented by two or more characters; there aren't single glyphs for them. zs in particular is causing trouble for grep: % echo zs | LANG=C grep '^[^a-z]*$' % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$' zs It's possible to come up with expressions that lead to similarly unexpected results for the other multi-char letters as well, but these don't occur frequently: % echo ty | LANG=C grep '^[s-u]*$' % echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$' ty This is undesirable and dumb, for several reasons: 1. grep has no way of knowing whether a zs sequence is a single letter or two letters, because the combination can occur in compound words without becoming a zs letter; for example, in fúvószenekar (fúvós + zenekar), it's simply an s and a z letter next to each other. There may even exist words that make (a different) sense either way, but I can't think of any right now. 2. zs is the last letter of the Hungarian alphabet; therefore, no sane character range in a regular expression can include it ([a-zs] would be ambiguous because there isn't a zs glyph). zs and the other multi-char letters play an important role in sorting (zs has to be sorted after za and so on), but please can we treat them as two characters in all other contexts? I can also make a socio-ergonomic point: I think most people who deal with regular expressions don't expect Hungarian multi-character letters to be treated as single characters in regular expressions, whether they are Hungarian or not. Andras -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32.7-vs2.3.0.36.28-hellgate (SMP w/3 CPU cores; PREEMPT) Locale: LANG=C, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages locales depends on: ii debconf [debconf-2.0] 1.5.28 Debian configuration management sy ii libc6 [glibc-2.10-1] 2.10.2-2 GNU C Library: Shared libraries locales recommends no packages. locales suggests no packages. -- debconf information: * locales/default_environment_locale: None * locales/locales_to_be_generated: en_GB ISO-8859-1, en_GB.ISO-8859-15 ISO-8859-15, en_GB.UTF-8 UTF-8, en_US ISO-8859-1, en_US.ISO-8859-15 ISO-8859-15, en_US.UTF-8 UTF-8, hu_HU ISO-8859-2, hu_HU.UTF-8 UTF-8 -- Andras Korn korn at elan.rulez.org - http://chardonnay.math.bme.hu/~korn/ A stitch in time would have confused Einstein. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org