Re: [julia-users] Re: UTF16String or UTF8String?

Scott Jones Sun, 27 Sep 2015 19:47:24 -0700

The ANSI Latin 1 character set, which is equivalent to the 1st 256 
characters of the Unicode character set,
supports the following languages: Western Europe and Americas: Afrikaans, 
Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, 
Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish 
and Swedish.

If you store things like Python 3, those will all be stored in 1 byte per 
character.
In UTF-8, those will use 2 bytes per character for characters between 
128-255.

Things like Greek, Arabic (most of it, at least), Hebrew, Cyrillic will 
take 2 bytes per character in UTF-8.

Since UTF-8 takes so much space when dealing with text from a large part of 
the world's languages
(and languages used by > 60% of the world population, by my estimates), in 
the past, I had to come up with packing schemes (that were designed for 
optimizing space, not ease of processing) for efficiently storing Unicode 
text in a database, which other people have also done (see BOCU-1 & SCSU).

Scott

On Sunday, September 27, 2015 at 7:38:06 PM UTC-4, Daniel Carrera wrote:
>
>
>
> On 27 September 2015 at 23:41, Scott Jones <scott.pa...@gmail.com 
> <javascript:>> wrote:
>
>> No.  Most characters used in the countries I mentioned above can be 
>> represented using just ANSI Latin1
>> (which is why I specified *Western Europe*), so UTF-8 will take 1 or 2 
>> bytes for each character,
>> but when you are dealing with the Middle East, India, or Asia (with a lot 
>> of the world's population!), UTF-8 takes 3 bytes per characters usually 
>> (non-BMP characters are still not that common, except in Tweets!)
>>
>
>
> This discussion is well over my head, but out of curiosity, is Greek 
> included as Western Europe? I mainly just care about Greek characters, and 
> possible German and Swedish.
>
> Daniel.
>

Re: [julia-users] Re: UTF16String or UTF8String?

Reply via email to