RE: Unicode support under Linux

Carl W. Brown Thu, 04 Oct 2001 10:16:18 -0700

Glenn Maynard,

> 
> > * strchr
> >   does not works at all, because UTF-8 character cannot be expressed
> >   with 'char' type.
> 
> >From my understanding of UTF-8, strchr() should work fine to search for
> 7-bit characters, and strstr()/strrstr() for searching for arbitrary
> Unicode characters (searching for them in UTF-8.)


You are correct you can use strchr with ASCII only.  However if you null terminate 
your search character, you can use strstr.  I implement an extended version of strchr 
that uses two char* pointers.  One to the string and the other to the character.

strcpy and strcat work but strncpy and strncat have problems.  If you copy n bytes and 
add a null you can end up with a partial character at the end of the string.

strcmp works but if the OS supports a strcasecmp (case insensitive compare) it won't.  
Speaking of case, there is no standard Linux facility even with wchar_t to do correct 
case conversion.  Proper string conversion 
http://www.unicode.org/unicode/reports/tr21/ has to allow for the resulting string to 
be a different size from the source string.  For example "ß" when shifted to upper 
case becomes "SS" adding one character to the string length.  There are a number or 
languages that depend on Unicode special casing and in many cases depend on the 
relationship of one character to others in the string.  For example in Greek, the 
lower case sigma ("Σ")has a final ("ς") and non-final ("σ") form.  I helped with 
the latest version of Unicode to make the standard conform more closely to culturally 
correct usage.  This is why toupper() in any form does not work.  The upper an lower 
case converters must be locale sensitive to support language like Turkish, Azeri, and 
Lithuanian.  You might get by in Lithuanian but the Turkish and Azeri dotted and dot 
less "I" will give you grief.  In Turkish "i" <-> "İ" and "ı" <-> "I".  But the 
worst is Greek because it relies heavily on special case rules to deal with 
ypogegrammeni (iota-subscript) or prosgegrammeni (iota adscript)

strlen and strstr work because if the special nature of UTF-8 in that the leading 
bytes of a character are different values that the trailing bytes.  

strpbrk, strspn, strcspn and strtok need wide character support.  You can not use the 
single byte implementations.

*ptr++  can not be used to move to the next character in a string.  You need a 
charnext, charprev routine.  You also need a charlen function.  You should also have a 
string validate function to insure that the string is well formed. 


> 
> Major speed hit, of course, but there's no way to fix that within UTF-8
> (and even using wchar_t is a major speed hit, just due to memory usage.)
> 

You are right the while functions like strstr will work with UTF-8 they are much 
slower.  strstr compares the matching string to the source byte by byte until a 
mismatch.  Then it increments the source by one byte.  If this byte is a continuation 
character there will be no hit.  This should not be too much of a problem since it 
should immediately mismatch.  It is a bit slower but not too bad.

I don't think that the extra paging due to extra memory usage is too bad.  We get 
bigger and faster systems every day.

> > I think the simplest way to substitute all these functions is to use
> > wide character.  Standard C library has wchar_t substitution of above
> > functions.  And, these are conversion functions between "multibyte
> > character" and "wide character".  Note that "multibyte character" does
> > not mean the character is always multibyte.  It is "locale-dependent
> > encoding".  This means that, in ISO-8859-1 locale, "multibyte character"
> > is ISO-8859-1.  In Big5 locale, "multibyte character" is Big5.  I.e.,
> > if you write your software using "multibyte character" and "wide
> > character",
> 
> Well, that means either a major memory hit for string-intensive
> programs (using wchar_t internally exclusively) or a lot of conversion
> (using multibyte internally); both imply a speed hit (above the
> expected.)  Both imply a lot of converting (the first, whenever you
> read or write to disk, files, for filenames, etc; the second, every
> time you call a wide C function.)  Using both WC and MB internally
> is rather annoying, too (nobody sane wants to deal with more than one
> string type.)

Another problem you have is that wchar_t has portability problems.  Solaris uses 
wchar_t for non-Unicode encoding.  AIX like Windows uses a 16 bit instead of a 32 bit 
wchar_t.

With UCS-4 and UCS-2 it was not too hard to mask the differences.  But with UTF-16 you 
have to deal with surrogate pairs that don't appear in UTF-32 so your code is 
different.

If you use wide character support you have to use it everywhere.  You can not convert 
a string from UTF-8 to UTF-32 and tokenize it with wcstok and expect that the results 
be mapped back to the original UTF-8 string.  You can to go WC all the way.  That 
means a lot of program constants that will also have to be changed.  With UTF-8 you 
don't have to change any constants that are pure ASCII.

The big hit comes with debugging.  It is a pain to read the UTF-32 strings.  This 
really increases the development cost especially with non-i18n programmers who don't 
keep a copy of the Unicode book on their desk at all times.

> 
> All C string functions can be implemented easily for UTF-8; the only
> hard part is doing it efficiently, and without converting the whole
> thing to wchar_t first.  Some functions are straightforward to
> implement reasonably fast, but you're always stuck with the UTF-8
> decoding logic ...
> 
> Is gdb yet smart enough to convert wchar_t * to the locale when displaying
> strings?  I doubt it; this probably makes using wchar_t internally
> harder to debug.
> 
> It also implies not taking advantage of some of the better aspects of
> UTF-8, like being able to do a strrchr() and strrstr() without having
> to scan from the beginning of the string.

For those who don't what Glenn is saying is that while UTF-8 is an MBCS (Multi-byte 
character set) but has a real advantage in that the range of byte encodings for 
initial character bytes and continuation character bytes does not overlap.  This means 
that you can scan backwards and find the start of the previous character very 
efficiently.  Most other MBCS encodings require that you start at the beginning of the 
string and scan forward until you find the previous character.  Some MBCS 
implementations use very sophisticated scan routines that can do some backwards scans 
on some characters sets but often they are slower than a forward scan would have been 
especially for short strings.

> 
> Of course, supporting arbitrary encodings is nice, but I wouldn't
> want to complicate a program too badly for it.  (That's from my "don't
> go out of your way to support obsolete software" perspective, of course--
> all those other annoying encodings being the obsolete software--but 
> it's not always that simple.)
> 

You can do that with xIUA and ICU.  In fact you might want to use the same sort of 
support with glibc.  That way if you want to go to ICU later or port to another 
platform you only have one piece of code to change.

xIUA supports different encodings dynamically.  You can have a routine that gets 
called with EUC-JP, UTF-8 or UTF-32 data and they all are handled correctly.  You can 
also invoke the UTF-8 support explicitly and save the overhead of checking to see 
which routine to call.  If you are communication with browsers for example, they don't 
all support UTF-8 properly.  It even has a bonus for HTML and XML it that you can tell 
the converter that any character that does not convert it will automatically convert 
to a NCR sequence.  This way you can send Japanese with the iso-5589-1 code page and 
not lose a character. 

It is OpenSource code so that you can use the UTF-8 support code including a fast 
transform routine that unlike other converters is stateless but will support 
converting data in chunks if you have to use a two stage conversion or require buffer 
limits such as a maximum log record size.

Carl



-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: Unicode support under Linux

Reply via email to