Hi, MySQL is looking for an authoritative, official statement which states all the current Hungarian collation rules. Please let other MySQL-using Hungarians (especially if you know a user group in Hungary) know about these questions. Best of all would be a translation of the Hungarian government standard, if there is one.
MySQL has received several complaints/suggestions about Hungarian collation. For example these three people contacted us via a public MySQL mailing list or bugs forum: RITZINGER Peter (http://bugs.mysql.com/bug.php?id=12519) BÁRTHÁZI András (http://lists.mysql.com/mysql/191427) Csongor Fagyal (http://bugs.mysql.com/bug.php?id=22337) In what follows I will refer to what seems to be agreed, and what seems to be disputed. The current latin2_hungarian_ci collation is a chart in sql/share/charsets/latin2.xml, and Mr Barkov has provided an easy-to-read web page: http://myoffice.izhnet.ru/bar/~bar/charts/latin2_hungarian_ci.html This collation is unlike the Hungarian dictionaries, collation descriptions, or products that we've seen. For example the first letter is: Latin Capital Letter A = Latin Small Letter A = control Single Shift 3 = No-Break Space = Latin Small Letter L with caron = Latin Capital Letter A with acute = Latin Small Letter A with acute But there is no reason that small L with caron (which is Slovak not Hungarian) ever sorts with A, there's some dispute whether A with acute should sort with A, and all other accented variants of A should be in this list too. It is likely that MySQL will deprecate this collation (which implies that MySQL will eventually remove it), after introducing a new and more correct one. Most people agree that this is the Hungarian alphabet; a á b c cs d dz dzs e é f g gy h i í j k l ly m n ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs (The DOUBLE ACUTE letters ő and ű are sometimes shown as õ and û but I suspect that is a conversion error.) Some people also say there's a secondary sort rule for these short/long vowel pairs: a á, e é, i í, o ó, ö ő, u ú, ü ű For these pairs, long = short usually, but long > short if all else is equal. I have seen comments showing that Oracle seems to follow this rule: 'BÁ'>'BA' is true 'BÁ'>'BAC' is false but the commenter, though Hungarian, didn't like what Oracle did. (thread 'nlssort' on comp.databases.oracle.server 2002-11-10) One commenter wrote to us about a similar thing, saying it's a mistake that SELECT 'hal' LIKE 'hál' is true. Unfortunately, the same person also disagrees, saying that we should have two collations, one where long > short, one where long = short. I have also seen Simonsen's rules: http://std.dkuug.dk/i18n/locales/hu_HU They suggest that A-acute > A, etc. I have also seen argument about the same thing for glibc: http://sources.redhat.com/ml/libc-locales/2005-q4/msg00002.html Apparently all Hungarians agree that these digraphs are "letters": cs dz dzs gy ly ny sz ty zs That's bad but not very bad. MySQL handles digraphs in Spanish. There is also one trigraph: dzs That's very bad. Luckily dzs is rare, it's mostly for English words with a "j" sound (bridge is 'briddz', gin is 'dzsinn') (so I'm told). There is a special rule when you see the first part of a digraph followed by the digraph. For example, in 'ggy', 'g' is the first part of 'gy' and it's followed by 'gy' ... and MySQL treats it as a repetition of the digraph, i.e. as if it's 'gygy'. This applies to all the letters listed in the previous paragraph, so: ccs = cscs, ddz = dzdz, ddsz = dzsdzs, ggy = gygy, lly = lyly, nny = nyny, ssz = szsz, tty = tyty, zzs = zszs. For example, Mr Ritzinger says that 'tty > tz' because tty is expanded to tyty. I know that other products handle the situation, but I've seen them called "double compressions", which worries me -- do some people think that 'cscs sorts with ccs' rather than 'ccs sorts with cscs'? A collation which follows the single-character rules, but ignores digraphs and trigraphs, sounds somewhat like what I see in Kaplan's remarks on Microsoft's Hungarian Technical Sort: http://blogs.msdn.com/michkap/archive/2005/11/26/495072.aspx One of the above-listed people would accept this, he says he doesn't care about digraphs or trigraphs. But I have no idea whether Microsoft was following some "technical" standard. All characters outside the Hungarian alphabet should be done according to UCA 4.0.0 (until MySQL switches to the newer UCA). For Unicode support, I suggest names for the new collations should be: ucs2_hungarian2_ci, utf8_hungarian2_ci. The only other character sets that may have Hungarian collations are latin2 and cp1250. Our concern at this time is only for the "primary sort", the collation necessary for searches. The "secondary sort" or "tertiary sort" rules, the ones that affect only ORDER BY, are of interest but will only be of importance in the future. -- Peter Gulutzan, Senior Software Architect MySQL AB, www.mysql.com Office: +1 780 472-6838 Mobile: +1 780 904-0297 VoIP: +1 408 213-6654 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]