Re: [julia-users] Re: UTF16String or UTF8String?

Scott Jones Sun, 27 Sep 2015 14:42:07 -0700


On Sunday, September 27, 2015 at 5:26:26 PM UTC-4, Páll Haraldsson wrote:
>
> 2015-09-27 20:29 GMT+00:00 Scott Jones <scott.pa...@gmail.com 
> <javascript:>>:
>
>> If it is mainly in North/South America, Western Europe, or Australia/NZ, 
>> UTF-8 does OK.
>> UTF-8 is great for data interchange, but can really slow things down if 
>> you have many non-ASCII characters
>>
>
> Did you mean non-BMP? Non-ASCII, but BMP ("European") will take 16 bits, 
> same as in UTF-16.
>


No.  Most characters used in the countries I mentioned above can be 
represented using just ANSI Latin1
(which is why I specified *Western Europe*), so UTF-8 will take 1 or 2 
bytes for each character,
but when you are dealing with the Middle East, India, or Asia (with a lot 
of the world's population!), UTF-8 takes 3 bytes per characters usually 
(non-BMP characters are still not that common, except in Tweets!)
 

> If you are not talking about indexing, then I'm a little surprised that 
> UTF-16, in that case can be faster as it will always be bigger.
>

Processing UTF-8, you have to do a lot of comparisons branching, or always 
be doing table lookups,
and the checks for invalid sequences are painful.
 

> (as well as bloat the size of any buffers you need - because you'll need 
>> to allocate 50% more space than for UTF-16, to be sure you can hold the 
>> same # of characters).
>>
>
> This seems to be a limitation of allocating fixed length 
> buffers/non-varchar and/or in number of chars, not bytes.
>

However you say it, if you are expecting up to 100 Kanji characters (just 
BMP, for sake of argument),
then you'll need 300 bytes if UTF-8, or 200 bytes, if UTF-16.
Let's say you want to allow up to 20% non-BMP characters, so that would be 
240 bytes for UTF-16, or
320 bytes for UTF-8.

At least in PostgreSQL, varchar is the preferred "default", that I use.. 
> Why would I use fixed size buffers (in chars)?
>

Depending on the SQL implementation, VARCHAR (as opposed to a CLOB) may end 
up allocating enough bytes to hold that many *characters* that you 
specified for your maximum, i.e. VARCHAR(255)), depending on what character 
set you used.

UTF-16 is used by Windows APIs, but also ICU, Java, C++ UnicodeString. 
>> Python 3 actually picks a 1,2,4 byte representation depending on what 
>> characters are in the string (so UTF-16, but with no surrogate pairs, when 
>> there are any characters > 0xff, but none > 0xffff).
>>
>
> I know. Helps with indexing, unless you want to, write into a string (a 
> char, not matching assumption..).
>
> -- 
> Palli.
>

Re: [julia-users] Re: UTF16String or UTF8String?

Reply via email to