The ANSI Latin 1 character set, which is equivalent to the 1st 256 characters of the Unicode character set, supports the following languages: Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.
If you store things like Python 3, those will all be stored in 1 byte per character. In UTF-8, those will use 2 bytes per character for characters between 128-255. Things like Greek, Arabic (most of it, at least), Hebrew, Cyrillic will take 2 bytes per character in UTF-8. Since UTF-8 takes so much space when dealing with text from a large part of the world's languages (and languages used by > 60% of the world population, by my estimates), in the past, I had to come up with packing schemes (that were designed for optimizing space, not ease of processing) for efficiently storing Unicode text in a database, which other people have also done (see BOCU-1 & SCSU). Scott On Sunday, September 27, 2015 at 7:38:06 PM UTC-4, Daniel Carrera wrote: > > > > On 27 September 2015 at 23:41, Scott Jones <scott.pa...@gmail.com > <javascript:>> wrote: > >> No. Most characters used in the countries I mentioned above can be >> represented using just ANSI Latin1 >> (which is why I specified *Western Europe*), so UTF-8 will take 1 or 2 >> bytes for each character, >> but when you are dealing with the Middle East, India, or Asia (with a lot >> of the world's population!), UTF-8 takes 3 bytes per characters usually >> (non-BMP characters are still not that common, except in Tweets!) >> > > > This discussion is well over my head, but out of curiosity, is Greek > included as Western Europe? I mainly just care about Greek characters, and > possible German and Swedish. > > Daniel. >