> On Nov 11, 2019, at 7:49 AM, Jose Isaias Cabrera <jic...@outlook.com> wrote:
> 
> if you want to count characters in languages such as Arabic, Hebrew, Chinese, 
> Japanese, etc., the easiest way is to convert that string to UTF32, and do a 
> string count of that UTF32 variable.

No, the easiest way is to ask your string class/library what the character 
count is, and let _it_ deal with the fiddly details. 

Or to consider why you need the character count in the first place — it’s 
usually not something that’s useful to know. Usually what you’re really asking 
is “how many pixels wide will this render?” or “how many bytes will this 
occupy?” or even “let me iterate over each character”.

At a low level, UTF-8 makes a lot more sense. It’s very compact, which is 
important for cache coherency as well as storage space. It’s upward compatible 
with ASCII, which is extremely convenient for text-based protocols / file 
formats / languages, and for working with legacy APIs (like <string.h>!)

Modern libraries seem to be moving to UTF-8. For instance, Apple’s been 
migrating Swift’s string class from a legacy UTF-16 encoding to UTF-8, and 
playing up the consequent performance and space win. Go has been UTF-8 from the 
start. I don’t know of a single library that’s gone with UTF-32, except maybe 
as an option.

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to