Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2014-04-09 Thread Andras Korn
On Fri, Feb 26, 2010 at 12:38:08AM -0600, Jonathan Nieder wrote:

 Computers are dumb
 --
 
 Andras wrote:
 
  1. grep has no way of knowing whether a zs sequence is a single letter
  or two letters, because the combination can occur in compound words without
  becoming a zs letter; for example, in fúvószenekar (fúvós +
  zenekar), it's simply an s and a z letter next to each other. There
  may even exist words that make (a different) sense either way, but I can't
  think of any right now.
 
 Are there simple heuristics that would make this condition easy to
 discover?  For example, vowels that would never appear before a true
 sz letter, things like that?  I am just curious; please feel free to
 e-mail me privately about this.
 
 This sounds like a (hard to fix) bug in the collation algorithm, but
 not a reason not to make 'sort' follow the conventions of the language.

Sorting is actually also tricky with dumb computers, because there is no way
for sort to know whether e.g. nyolcszáz contains a cs collating symbol
followed by z or a c followed by an sz collating symbol (the latter is
in fact the case).

cs+z would be sorted after cz (because cs comes after c), but c+sz
would be sorted _before_ cz because sz precedes z.

I'd say this is unfixable. There is no way, short of understanding the
natural language, for a program to determine whether two (or three)
characters represent a single collating symbol or themselves.

Clearly, the Hungarian language must be fixed, either by introducing
separate glyphs for the composite letters, or by no longer insisting that
something represented by more than one character is a single letter. :)

Andras

-- 
 If Chuck Norris had been Spartan, the movie would simply have been called 1.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-25 Thread Jonathan Nieder
Hi again,

Odd names for collating elements


I wrote:

  $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.ch.]]/MATCHED/'
  sed: -e expression #1, char 21: Invalid collation character
 
 Odd, no?

It did seem odd, especially since the POSIX documentation uses
examples like this all the time (usually [.ch.] from pre-1994
Spanish).  For example [1]:

 collating-element ch from ch
 collating-element e-acute from acutee
 collating-element ll from ll

I was missing something obvious: in GNU locales, the collating element
has a hyphenated name.

  collating-symbol  zs
  collating-element z-s from U007AU0073

 $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.c-h.]]/MATCHED/'
 MATCHED and more

So there’s the workaround.  I think this is a real bug: POSIX 1.2008
says [2]:

A collating symbol is a collating element enclosed within
bracket-period ( [. and .] ) delimiters. Collating
elements are defined as described in Collation Order .
Conforming applications shall represent multi-character
collating elements as collating symbols when it is
necessary to distinguish them from a list of the
individual characters that make up the multi-character
collating element. For example, if the string ch is a
collating element defined using the line:

collating-element ch-digraph from ch

in the locale definition, the expression [[.ch.]] shall
be treated as an RE containing the collating symbol 'ch',
while [ch] shall be treated as an RE matching 'c' or
'h' . Collating symbols are recognized only inside
bracket expressions. If the string is not a collating
element in the current locale, the expression is invalid.

In other words, in the “collating-element z-s from U007AU0073”
line, it is not the z-s that names the collating symbol in
regexps.

This makes sense, since otherwise how could anyone write portable
regular expressions?

Writing [:alpha:] in Hungarian
--

Andras wrote:

 zs in particular is causing trouble for grep:

 % echo zs | LANG=C grep '^[^a-z]*$'
 % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$'
 zs

Any program using such constructions without LC_COLLATE=C or similar
is IMHO buggy because of exactly this problem.  With some C libraries
(though not current glibc, luckily), in English, [^a-z] matches A but
not Z or vice versa [3].  (Current POSIX leaves the behavior
unspecified.)

The . notation seems to work here:

 % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-[.z-s.]]*$'
 %

Once the regexp engine is fixed, that regexp would become
'^[^a-[.zs.]]*$'.

Bracket expressions match collating elements


Andras wrote:

 % echo ty | LANG=C grep '^[s-u]*$'
 % echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$'
 ty

POSIX is unambiguous about this: bracket expressions match collating
elements, not characters.

I can imagine situations where this would be helpful and situations
where it would be unhelpful.  Mostly, it just seems difficult to do
any other way, since otherwise what would the ranges mean?  The
simplest workaround is to use LC_COLLATE=C (or en_US.UTF-8, or C.UTF-8
once glibc learns that, or whatever locale has the behavior you want).

Computers are dumb
--

Andras wrote:

 1. grep has no way of knowing whether a zs sequence is a single letter
 or two letters, because the combination can occur in compound words without
 becoming a zs letter; for example, in fúvószenekar (fúvós +
 zenekar), it's simply an s and a z letter next to each other. There
 may even exist words that make (a different) sense either way, but I can't
 think of any right now.

Are there simple heuristics that would make this condition easy to
discover?  For example, vowels that would never appear before a true
sz letter, things like that?  I am just curious; please feel free to
e-mail me privately about this.

This sounds like a (hard to fix) bug in the collation algorithm, but
not a reason not to make 'sort' follow the conventions of the language.

An argument could be made that although 'sort' should use the
customary collation order, regexp matching should not.  The strongest
counterargument I know of is that it is hard to find a different rule
that would be useful for regular expressions in, e.g., Hebrew.

. matches a character
-

Andras wrote:

 % echo zs | LANG=hu_HU.UTF-8 grep ^[a-z]*$
 zs
 % echo azsa | LANG=hu_HU.UTF-8 grep ^a.a$
 % echo azsa | LANG=hu_HU.UTF-8 grep ^a[^a-z]a$
 azsa

POSIX is unambiguous about this, too: . matches a single character,
not a collating element.

I assume this is mostly for speed.  If you want to match an arbitrary
collating element, it is not obvious to me how to.  [[:print:]] would
capture the most important ones.

A related collating element bug
---

There is some other ugliness: any single-byte character like [.e.]
works 

Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-24 Thread Andras Korn
On Tue, Feb 23, 2010 at 10:29:25PM -0600, Jonathan Nieder wrote:

  2. zs is the last letter of the Hungarian alphabet; therefore, no sane
  character range in a regular expression can include it ([a-zs] would be
  ambiguous because there isn't a zs glyph).
 
 Would [a-[.zs.]] work?

̈́No, because apparently [.zs.] isn't a valid collating element:

% echo azsa | LANG=hu_HU.UTF-8 grep ^a[a-[.zs.]]a$
grep: Invalid collation character

 See
 http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

That was helpful, thanks - I didn't know about collating elements in REs.

 Lots of the behavior of regular expressions in non-C locales is
 counterintuitive, so it might be helpful to point out if each example
 violates some rule of the standard or only common sense (both are
 important, of course).

Uh, that standard is too dense for me; I'll pass on that and can only vouch
for common sense.

Andras

-- 
 Andras Korn korn at elan.rulez.org - http://chardonnay.math.bme.hu/~korn/
My new year's resolution is 1920x1080.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-24 Thread Clint Adams
On Wed, Feb 24, 2010 at 10:13:09AM +0100, Andras Korn wrote:
 ??No, because apparently [.zs.] isn't a valid collating element:

Should it be?



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-24 Thread Jonathan Nieder
Clint Adams wrote:
 On Wed, Feb 24, 2010 at 10:13:09AM +0100, Andras Korn wrote:
 No, because apparently [.zs.] isn't a valid collating element:

 Should it be?

Yes, I think so: it comes after z in alphabetical order.  See
http://lists.mysql.com/mysql/204718 for example.

glibc thinks so too, AFAICT.  From localedata/locales/hu_HU:

 collating-symbol  zs
 collating-element Z-S from U005AU0053
 collating-element Z-s from U005AU0073
 collating-element z-S from U007AU0053
 collating-element z-s from U007AU0073
 collating-element Z-Z-S from U005AU005AU0053
 collating-element Z-Z-s from U005AU005AU0073
 collating-element Z-z-S from U005AU007AU0053
 collating-element Z-z-s from U005AU007AU0073
 collating-element z-Z-S from U007AU005AU0053
 collating-element z-Z-s from U007AU005AU0073
 collating-element z-z-S from U007AU007AU0053
 collating-element z-z-s from U007AU007AU0073

Anyway, I decided to try a collating element from another language.
ch is a single letter for collation in Welsh.

 $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/./MATCHED/'
 MATCHEDh and more
 $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[^a]/MATCHED/'
 MATCHED and more
 $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.ch.]]/MATCHED/'
 sed: -e expression #1, char 21: Invalid collation character

Odd, no?



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-23 Thread Andras Korn
On Mon, Feb 22, 2010 at 11:07:21AM +0100, Andras Korn wrote:

 1. grep has no way of knowing whether a zs sequence is a single letter
 or two letters, because the combination can occur in compound words without
 becoming a zs letter; for example, in fúvószenekar (fúvós +
 zenekar), it's simply an s and a z letter next to each other. There
 may even exist words that make (a different) sense either way, but I can't
 think of any right now.

Uh, sorry, wrong example (sz instead of zs). Some examples for zs are
község, egészség (especially interesting because it contains an sz
followed by an s, not an s followed by a zs), gazság etc.

 2. zs is the last letter of the Hungarian alphabet; therefore, no sane
 character range in a regular expression can include it ([a-zs] would be
 ambiguous because there isn't a zs glyph).

It actually gets even more confusing, because grep's behaviour is
inconsistent:

% echo zs | LANG=hu_HU.UTF-8 grep ^[a-z]*$
zs
% echo azsa | LANG=hu_HU.UTF-8 grep ^a.a$
% echo azsa | LANG=hu_HU.UTF-8 grep ^a[^a-z]a$
azsa

So is zs a member of the [a-z] class or not? The first attempt matches z
and s individually, because zs doesn't match . (as shown in the second
example). However, in the last example, zs matches [^a-z], which is also
only supposed to match a single character.

The problem also affects sed(1) similarly:

% echo azsa | LANG=hu_HU.UTF-8 sed -n /^a[^a-z]a$/p
azsa

Therefore, I believe this is a bug in locales, not grep.

Andras

-- 
 Andras Korn korn at elan.rulez.org - http://chardonnay.math.bme.hu/~korn/
Never say 'OOPS!' Always say 'Ah, Interesting!'



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-23 Thread Jonathan Nieder
Hi,

I have no clue about the rest of these, but

Andras Korn wrote:
 On Mon, Feb 22, 2010 at 11:07:21AM +0100, Andras Korn wrote:

 2. zs is the last letter of the Hungarian alphabet; therefore, no sane
 character range in a regular expression can include it ([a-zs] would be
 ambiguous because there isn't a zs glyph).

Would [a-[.zs.]] work?

See
http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

Lots of the behavior of regular expressions in non-C locales is
counterintuitive, so it might be helpful to point out if each example
violates some rule of the standard or only common sense (both are
important, of course).

 The problem also affects sed(1) similarly:
 
 % echo azsa | LANG=hu_HU.UTF-8 sed -n /^a[^a-z]a$/p
 azsa

sed uses re_compile_pattern() and so on from glibc (same maintainers
as locales).  I don’t know if grep does also.

Hope that helps,
Jonathan



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#570929: Hungarian locale: zs is treated as a single letter, with undesirable consequences

2010-02-22 Thread Andras Korn
Package: locales
Version: 2.10.2-6
Severity: normal

Hi,

in Hungarian, zs (as well as sz, cs, ty, dz, dzs, gy and ly)
are said to be part of the alphabet and each combination is considered to be
a single letter; however, they are represented by two or more characters;
there aren't single glyphs for them.

zs in particular is causing trouble for grep:

% echo zs | LANG=C grep '^[^a-z]*$'
% echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$'
zs

It's possible to come up with expressions that lead to similarly unexpected
results for the other multi-char letters as well, but these don't occur
frequently:

% echo ty | LANG=C grep '^[s-u]*$'
% echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$'
ty

This is undesirable and dumb, for several reasons:

1. grep has no way of knowing whether a zs sequence is a single letter
or two letters, because the combination can occur in compound words without
becoming a zs letter; for example, in fúvószenekar (fúvós +
zenekar), it's simply an s and a z letter next to each other. There
may even exist words that make (a different) sense either way, but I can't
think of any right now.

2. zs is the last letter of the Hungarian alphabet; therefore, no sane
character range in a regular expression can include it ([a-zs] would be
ambiguous because there isn't a zs glyph).

zs and the other multi-char letters play an important role in sorting
(zs has to be sorted after za and so on), but please can we treat them
as two characters in all other contexts?

I can also make a socio-ergonomic point: I think most people who deal with
regular expressions don't expect Hungarian multi-character letters to be
treated as single characters in regular expressions, whether they are
Hungarian or not.

Andras

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32.7-vs2.3.0.36.28-hellgate (SMP w/3 CPU cores; PREEMPT)
Locale: LANG=C, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages locales depends on:
ii  debconf [debconf-2.0] 1.5.28 Debian configuration management sy
ii  libc6 [glibc-2.10-1]  2.10.2-2   GNU C Library: Shared libraries

locales recommends no packages.

locales suggests no packages.

-- debconf information:
* locales/default_environment_locale: None
* locales/locales_to_be_generated: en_GB ISO-8859-1, en_GB.ISO-8859-15 
ISO-8859-15, en_GB.UTF-8 UTF-8, en_US ISO-8859-1, en_US.ISO-8859-15 
ISO-8859-15, en_US.UTF-8 UTF-8, hu_HU ISO-8859-2, hu_HU.UTF-8 UTF-8

-- 
 Andras Korn korn at elan.rulez.org - http://chardonnay.math.bme.hu/~korn/
A stitch in time would have confused Einstein.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org