Re: utf8 options under Mysql

2016-05-05 Thread Hal.sz S.ndor

2016/04/22 04:49 ... Jigal van Hemert:

It works for a lot of Western European languages very well, but in some
cases there are problems. For Asian languages there are a lot more
problems. For example, 'ß' isn't considered the same as 'ss'.
Well, the former is an sz-ligature, and the latter is a digraph that is 
not the same as the ligature--and the language is German, which is not 
Asian.

*personal polemic follows*
Germans can save themselves trouble, I believe, if they give up varying 
'ss' with the sz-ligature, and give up the latter in favor of a digraph 
'sz'--but that won't help MySQL because it, and all like software, has 
to support all the already written German saved in it.


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: utf8 options under Mysql

2016-04-22 Thread Jigal van Hemert

Hi,

On 22/04/2016 04:50, Martin Mueller wrote:

MySQL has a bewildering variety of unicode collation choices. Most of them are language specific, but what is 
the difference between "utf8-general-ci", "utf8-unicode-ci", and 
"utf8-unicode-520-ci." Do they differ in the range of characters they can handle or is it just a 
matter of the cort order. I understand that utf8-bin is different because it is case sensitive, but the other 
differences elude me.

Under what circumstances does it make a difference to use on or the other? I work with a 
lot of Early Modern print data and the weird symbols of various kinds they use. I've had 
trouble at times with the "utf8-general-ci" setting, but it may have been more 
a matter of settings on my front end tool than of the choice of this rather than unicode 
collation.

Under character sets, there is just one utf8 setting.  The simplest way to make 
sense of the choices would be to say that given a character set (utf8) the 
collation only makes a difference to the sort but makes no difference to what 
can be displayed. Is that correct.
A collation contains definitions for sorting order and comparison. For 
most purposes one wants "crème brûlée" to be the same as "creme brulee". 
For unicode characters these rules can be complex. A character set (in 
your case UTF-8) defines which character can be stored.


utf8-general-ci contains a simplified version of those conversion rules. 
It works for a lot of Western European languages very well, but in some 
cases there are problems. For Asian languages there are a lot more 
problems. For example, 'ß' isn't considered the same as 'ss'.


utf8-unicode-ci has more complex rules and works fine for more 
languages. Due to the more complex rule set it is a bit slower than 
utf8-general-ci.


utf8-unicode-520-ci uses a newer version of the rule set that is used in 
utf8-unicode-ci.


Other utf8-* collations may contain specific rules for specific languages

utf8-general-ci is the default collation for utf-8 in MySQL. If you use 
literal strings MySQL may assume that these have the default collation 
and comparing them to columns with other collations or performing things 
like cast operations may produce errors about invalid combinations of 
collations.


--

Met vriendelijke groet,

Jigal van Hemert.


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



utf8 options under Mysql

2016-04-21 Thread Martin Mueller
MySQL has a bewildering variety of unicode collation choices. Most of them are 
language specific, but what is the difference between "utf8-general-ci", 
"utf8-unicode-ci", and "utf8-unicode-520-ci." Do they differ in the range of 
characters they can handle or is it just a matter of the cort order. I 
understand that utf8-bin is different because it is case sensitive, but the 
other differences elude me. 

Under what circumstances does it make a difference to use on or the other? I 
work with a lot of Early Modern print data and the weird symbols of various 
kinds they use. I've had trouble at times with the "utf8-general-ci" setting, 
but it may have been more a matter of settings on my front end tool than of the 
choice of this rather than unicode collation. 

Under character sets, there is just one utf8 setting.  The simplest way to make 
sense of the choices would be to say that given a character set (utf8) the 
collation only makes a difference to the sort but makes no difference to what 
can be displayed. Is that correct.