On Sunday, September 27, 2015 at 5:26:26 PM UTC-4, Páll Haraldsson wrote: > > 2015-09-27 20:29 GMT+00:00 Scott Jones <scott.pa...@gmail.com > <javascript:>>: > >> If it is mainly in North/South America, Western Europe, or Australia/NZ, >> UTF-8 does OK. >> UTF-8 is great for data interchange, but can really slow things down if >> you have many non-ASCII characters >> > > Did you mean non-BMP? Non-ASCII, but BMP ("European") will take 16 bits, > same as in UTF-16. >
No. Most characters used in the countries I mentioned above can be represented using just ANSI Latin1 (which is why I specified *Western Europe*), so UTF-8 will take 1 or 2 bytes for each character, but when you are dealing with the Middle East, India, or Asia (with a lot of the world's population!), UTF-8 takes 3 bytes per characters usually (non-BMP characters are still not that common, except in Tweets!) > If you are not talking about indexing, then I'm a little surprised that > UTF-16, in that case can be faster as it will always be bigger. > Processing UTF-8, you have to do a lot of comparisons branching, or always be doing table lookups, and the checks for invalid sequences are painful. > (as well as bloat the size of any buffers you need - because you'll need >> to allocate 50% more space than for UTF-16, to be sure you can hold the >> same # of characters). >> > > This seems to be a limitation of allocating fixed length > buffers/non-varchar and/or in number of chars, not bytes. > However you say it, if you are expecting up to 100 Kanji characters (just BMP, for sake of argument), then you'll need 300 bytes if UTF-8, or 200 bytes, if UTF-16. Let's say you want to allow up to 20% non-BMP characters, so that would be 240 bytes for UTF-16, or 320 bytes for UTF-8. At least in PostgreSQL, varchar is the preferred "default", that I use.. > Why would I use fixed size buffers (in chars)? > Depending on the SQL implementation, VARCHAR (as opposed to a CLOB) may end up allocating enough bytes to hold that many *characters* that you specified for your maximum, i.e. VARCHAR(255)), depending on what character set you used. UTF-16 is used by Windows APIs, but also ICU, Java, C++ UnicodeString. >> Python 3 actually picks a 1,2,4 byte representation depending on what >> characters are in the string (so UTF-16, but with no surrogate pairs, when >> there are any characters > 0xff, but none > 0xffff). >> > > I know. Helps with indexing, unless you want to, write into a string (a > char, not matching assumption..). > > -- > Palli. >