Hungarian collation

Peter Gulutzan Tue, 17 Oct 2006 09:15:56 -0700

Hi,

MySQL is looking for an authoritative, official statement
which states all the current Hungarian collation rules.
Please let other MySQL-using Hungarians (especially if you
know a user group in Hungary) know about these
questions. Best of all would be a translation of the
Hungarian government standard, if there is one.


MySQL has received several complaints/suggestions about
Hungarian collation. For example these three people
contacted us via a public MySQL mailing list or bugs forum:
RITZINGER Peter (http://bugs.mysql.com/bug.php?id=12519)
BÁRTHÁZI András (http://lists.mysql.com/mysql/191427)
Csongor Fagyal (http://bugs.mysql.com/bug.php?id=22337)
In what follows I will refer to what seems to be agreed,
and what seems to be disputed.

The current latin2_hungarian_ci collation
is a chart in sql/share/charsets/latin2.xml,
and Mr Barkov has provided an easy-to-read web page: 
http://myoffice.izhnet.ru/bar/~bar/charts/latin2_hungarian_ci.html
This collation is unlike the Hungarian dictionaries,
collation descriptions, or products that we've seen.
For example the first letter is:
Latin Capital Letter A
= Latin Small Letter A
= control Single Shift 3
= No-Break Space
= Latin Small Letter L with caron
= Latin Capital Letter A with acute
= Latin Small Letter A with acute
But there is no reason that small L with caron
(which is Slovak not Hungarian) ever sorts with
A, there's some dispute whether A with acute
should sort with A, and all other
accented variants of A should be in this list too.
It is likely that MySQL will deprecate this
collation (which implies that MySQL will eventually
remove it), after introducing a new and more correct
one.

Most people agree that this is the Hungarian alphabet;
a á b c cs d dz dzs e é f g gy h i í j k l ly m n
ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs

(The DOUBLE ACUTE letters ő and ű are sometimes shown
as õ and û but I suspect that is a conversion error.)

Some people also say there's a secondary sort
rule for these short/long vowel pairs:
a á, e é, i í, o ó, ö ő, u ú, ü ű
For these pairs, long = short usually, but long > short
if all else is equal.
I have seen comments showing that Oracle seems to follow this rule:
'BÁ'>'BA' is true
'BÁ'>'BAC' is false
but the commenter, though Hungarian, didn't like what Oracle did.
(thread 'nlssort' on comp.databases.oracle.server 2002-11-10)
One commenter wrote to us about a similar thing, saying
it's a mistake that SELECT 'hal' LIKE 'hál' is true.
Unfortunately, the same person also disagrees, saying
that we should have two collations, one where
long > short, one where long = short.
I have also seen Simonsen's rules:
http://std.dkuug.dk/i18n/locales/hu_HU
They suggest that A-acute > A, etc.
I have also seen argument about the same thing for glibc:
http://sources.redhat.com/ml/libc-locales/2005-q4/msg00002.html

Apparently all Hungarians agree that these digraphs are "letters":
cs dz dzs gy ly ny sz ty zs
That's bad but not very bad. MySQL handles digraphs in Spanish.
There is also one trigraph:
dzs
That's very bad. Luckily dzs is rare, it's mostly for
English words with a "j" sound (bridge is 'briddz',
gin is 'dzsinn') (so I'm told).

There is a special rule when you see the first part of a
digraph followed by the digraph. For example, in 'ggy',
'g' is the first part of 'gy' and it's followed by 'gy'
... and MySQL treats it as a repetition of the digraph, i.e.
as if it's 'gygy'. This applies to all the letters listed
in the previous paragraph, so:
ccs = cscs, ddz = dzdz, ddsz = dzsdzs, ggy = gygy,
lly = lyly, nny = nyny, ssz = szsz, tty = tyty, zzs = zszs.
For example, Mr Ritzinger says that
'tty > tz' because tty is expanded to tyty.
I know that other products handle the situation, but I've
seen them called "double compressions", which worries me --
do some people think that 'cscs sorts with ccs' rather than
'ccs sorts with cscs'?

A collation which follows the single-character rules, but
ignores digraphs and trigraphs, sounds somewhat like what I
see in Kaplan's remarks on Microsoft's Hungarian Technical Sort:
http://blogs.msdn.com/michkap/archive/2005/11/26/495072.aspx
One of the above-listed people would accept this, he says he doesn't
care about digraphs or trigraphs. But I have no idea whether
Microsoft was following some "technical" standard.

All characters outside the Hungarian alphabet should be done
according to UCA 4.0.0 (until MySQL switches to the newer UCA).

For Unicode support, I suggest names for the new collations
should be: ucs2_hungarian2_ci, utf8_hungarian2_ci. The only
other character sets that may have Hungarian collations are
latin2 and cp1250.

Our concern at this time is only for the "primary sort", the
collation necessary for searches. The "secondary sort" or
"tertiary sort" rules, the ones that affect only ORDER BY,
are of interest but will only be of importance in the future.

-- 
Peter Gulutzan, Senior Software Architect
MySQL AB, www.mysql.com
Office: +1 780 472-6838
Mobile: +1 780 904-0297
VoIP:   +1 408 213-6654



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Hungarian collation

Reply via email to