Possible Unicode Problems in Busybox - Collect and Discussion
Hi All ! I start this thread to collect and discuss the possible Unicode (UTF-8) problems we detected and which may need further investigation: 1) sed s/./x/ the dot matches bytes not characters This at least hits uClibc builds, glibc seam to work correct with full set of locale files. This bug may also affect other applets using regular expressions. 2) shell substitution ${#var} Does this length operation shall give the number of bytes in var or the number of characters (which may differ for multi byte characters, like UTF-8). 3) applet expr, function length STRING This also may hit the *index*, *substr* and *match* functions. Do we look at positions of character or at byte positions. What does the specs say on this? 4) applet printf, string formats "%Ns" Does this N mean character positions or bytes. The underlying C printf used to work with bytes for decades. The man page talks about character positions, but printf from bash uses bytes. 5) applet awk, function length() This may also hit other string functions, like *index*, *match*, *substr*, *sub*, etc. Those functions used to work with byte positions for decades, neglecting multi byte characters. The specs don't seam to be concrete on this. Changing things may break many existing scripts! Do we have further points where we get hit on this topic? -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Additional commands which may be hit by this question: cut -c, -f fold -w Looks as BB does it right, but different from upstream. sort, position specification ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
just remember utf-8 is not related to wchar, its just a serie of chars displayed as a single column. ive seen several implementations which use mbtowc functions to test some special chars, this is not correct for utf 8 in my opinion. if cut fields supports strings bigger than a single char, there should be no problem, the serie is found in input text. 2014-08-13 14:10 GMT+02:00 Harald Becker : > Additional commands which may be hit by this question: > > cut -c, -f > > fold -w > Looks as BB does it right, but different from upstream. > > sort, position specification > > > > ___ > busybox mailing list > busybox@busybox.net > http://lists.busybox.net/mailman/listinfo/busybox > ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi ! > if cut fields supports strings bigger than a single char, there > should be no problem, the serie is found in input text. $ echo -n äöü | hd c3 a4 c3 b6 c3 bc $ echo -n äöü | cut -c1 | hd c3 0a $ echo -n äöü | cut -c2 | hd a4 0a This shows the position given with cut -c does not pick the correct character. BB same as upstream. cut has a -b option to specify the byte position, but -c is called to use character positions. So I expect either -c1 (when counted from zero) or -c2 (when counted from one) to omit the "ö" (oumlaut) from the input text. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
in this case yes indeed, my mblen() function posted some days ago could be used to prevent display of cutted char series. The real problem with unicode is utf-16 which contains \0 chars (but its another and uncommon problem) 2014-08-13 15:17 GMT+02:00 Harald Becker : > Hi ! > > > > if cut fields supports strings bigger than a single char, there > > should be no problem, the serie is found in input text. > > $ echo -n äöü | hd > c3 a4 c3 b6 c3 bc > > $ echo -n äöü | cut -c1 | hd > c3 0a > > $ echo -n äöü | cut -c2 | hd > a4 0a > > This shows the position given with cut -c does not pick the correct > character. BB same as upstream. > > cut has a -b option to specify the byte position, but -c is called to use > character positions. So I expect either -c1 (when counted from zero) or -c2 > (when counted from one) to omit the "ö" (oumlaut) from the input text. > > -- > Harald > > ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
ive seen several implementations which use mbtowc functions to test some special chars, this is not correct for utf 8 in my opinion. To count the number of UTF-8 characters is really simple, just count all bytes except those with value in range 0x80 to 0xBF. This has two exceptions 0xFE and 0xFF which are no official UTF-8 characters, but I think it's not wrong to count and behave as such. counting can be done with one logical an one compare instruction: if ((c ^ 0x40) < 0xC0) n++ ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
The real problem with unicode is utf-16 which contains \0 chars (but its another and uncommon problem) This unveils an interesting question: Do we want to add UTF-8 support to BB or full multi byte support. The former may be simpler, the later more correct. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, Aug 13, 2014 at 1:40 PM, Harald Becker wrote: > 2) shell substitution ${#var} > Does this length operation shall give the number of bytes in var or the > number of characters (which may differ for multi byte characters, like > UTF-8). bash gives number of Unicode chars. I just fixed both ash and hush to do the same. > 4) applet printf, string formats "%Ns" > Does this N mean character positions or bytes. The underlying C printf used > to work with bytes for decades. The man page talks about character > positions, but printf from bash uses bytes. Also, printf needs to support \u -- vda ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, Aug 13, 2014 at 3:42 PM, Harald Becker wrote: > >> ive seen several implementations which use mbtowc functions to test some >> special chars, this is not correct for utf 8 in my opinion. > > > To count the number of UTF-8 characters is really simple, just count all > bytes except those with value in range 0x80 to 0xBF. This has two exceptions > 0xFE and 0xFF which are no official UTF-8 characters, but I think it's not > wrong to count and behave as such. > > > counting can be done with one logical an one compare instruction: > > if ((c ^ 0x40) < 0xC0) n++ include/{libbb,unicode}.h already have a bunch of helpers to do unicode_strlen(), and a few other typical functions: typedef struct uni_stat_t { unsigned byte_count; unsigned unicode_count; unsigned unicode_width; } uni_stat_t; /* Returns a string with unprintable chars replaced by '?' or * SUBST_WCHAR. This function is unicode-aware. */ const char* FAST_FUNC printable_string(uni_stat_t *stats, const char *str); /* Number of unicode chars. Falls back to strlen() on invalid unicode */ size_t FAST_FUNC unicode_strlen(const char *string); /* Width on terminal */ size_t FAST_FUNC unicode_strwidth(const char *string); enum { UNI_FLAG_PAD = (1 << 0), }; char* FAST_FUNC unicode_conv_to_printable(uni_stat_t *stats, const char *src); char* FAST_FUNC unicode_conv_to_printable_fixedwidth(/*uni_stat_t *stats,*/ const char *src, unsigned width); ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, Aug 13, 2014 at 4:01 PM, Harald Becker wrote: > >> The real problem with unicode is utf-16 which contains \0 chars (but its >> another and uncommon problem) > > > This unveils an interesting question: Do we want to add UTF-8 support to BB > or full multi byte support. The former may be simpler, the later more > correct. The world seems to be standardizing on utf-8. Thank God, supporting gazillion of encodings is no fun. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Denys! This unveils an interesting question: Do we want to add UTF-8 support to BB or full multi byte support. The former may be simpler, the later more correct. The world seems to be standardizing on utf-8. Thank God, supporting gazillion of encodings is no fun. Full ACK. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Denys ! > The world seems to be standardizing on utf-8. Thank God, supporting gazillion of encodings is no fun. You say this, but libbb/unicode.c contains a unicode_strlen calling this complex mb to wc conversion function to count the number of characters. Those multi byte functions tend to be highly complex and slow (don't know if they have gone better). For just UTF-8, things can be optimized. e.g. size_t utf8len( const char* s ) { size_t n = 0; while (*s) if ((*s++ ^ 0x40) < 0xC0) n++; return n; } size_t mystrlen( const char* s ) { return utf8_enabled ? utf8len(s) : strlen(s); } This looks more, but avoids inclusion of mb function. Most compiler shall produce fast code for utf8len. utf8len is for UTF-8 only usage, mystrlen may be used to switch betwean 8-bit-locale and UTF-8. If we could switch to UTF-8 only, we may forget of mystrlen and always use utf8len. Another fast function I use for UTF-8 ... skip to Nth UTF-8 character in a string (returns a pointer to trailing \0 if N > number of UTF-8 chars in string): char *utf8skip( char const* s, size_t n ) { for ( ; n && *s; --n ) while ((*++s ^ 0x40) >= 0xC0); return (char*)s; } Those are examples, other functions may also be optimized. It all depends on the question if those darn big mb functions shall be used or not. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Denys ! 2) shell substitution ${#var} Does this length operation shall give the number of bytes in var or the number of characters (which may differ for multi byte characters, like UTF-8). bash gives number of Unicode chars. I just fixed both ash and hush to do the same. bash seams to be the only shell which does this. So is this a bash-ism? ... and "expr length $var" (upstream) still return the size in bytes not characters. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Denys ! 2) shell substitution ${#var} Does this length operation shall give the number of bytes in var or the number of characters (which may differ for multi byte characters, like UTF-8). bash gives number of Unicode chars. I just fixed both ash and hush to do the same. Add a big warning in release notes, I ought this is a shell script breaker. Shell scripts which rely on getting the number of bytes may now fail, because they allocate space or copy less characters. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, 2014-08-13 at 19:23 +0200, Harald Becker wrote: > > bash gives number of Unicode chars. > > I just fixed both ash and hush to do the same. > > bash seams to be the only shell which does this. So is this a > bash-ism? The POSIX standard says that ${#var} give the length of variable var "in characters". I can't find, offhand, a definition of "characters" in the standard, but one would assume that if they meant bytes they would say that... ? Probably worth a question to the POSIX folks for a clarification. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, 2014-08-13 at 13:52 -0400, Paul Smith wrote: > The POSIX standard says that ${#var} give the length of variable var > "in characters". I can't find, offhand, a definition of "characters" > in the standard D'oh! It was only in the most obvious place: 3.87 Character A sequence of one or more bytes representing a single graphic symbol or control code. So it seems like bash has this right. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On 13.08.2014 19:56, Paul Smith wrote: On Wed, 2014-08-13 at 13:52 -0400, Paul Smith wrote: The POSIX standard says that ${#var} give the length of variable var "in characters". I can't find, offhand, a definition of "characters" in the standard D'oh! It was only in the most obvious place: 3.87 Character A sequence of one or more bytes representing a single graphic symbol or control code. So it seems like bash has this right. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Paul ! >>> The POSIX standard says that ${#var} give the length of variable var "in characters". I can't find, offhand, a definition of "characters" in the standard D'oh! It was only in the most obvious place: 3.87 Character A sequence of one or more bytes representing a single graphic symbol or control code. So it seems like bash has this right. Oh, nice! If this it what the standards say, I like it. I just wanted to be objective. A script breaker will it be anyway, any change on this. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
3) applet expr, function length STRING This also may hit the *index*, *substr* and *match* functions. Do we look at positions of character or at byte positions. What does the specs say on this? Looks like they removed *length*, *index*, *substr* and *match* from the specification of this applet??? ... and spec says: expr "$var" : '.*' shall return the number of characters. $ xxx=äöü $ expr "$xxx" : '.*' 6 ... outch! (Both BB and upstream) ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Denys! >> 2) shell substitution ${#var} Does this length operation shall give the number of bytes in var or the number of characters (which may differ for multi byte characters, like UTF-8). bash gives number of Unicode chars. I just fixed both ash and hush to do the same. You fixed this one, but there are two more related shell substitutions to modify (sorry didn't find them earlier): ${xxx:offset} ${xxx:offset:length} offset and length are given in number of characters in bash, BB ash uses bytes. ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
AW: Possible Unicode Problems in Busybox - Collect and Discussion
> Von: Harald Becker > Gesendet: Mittwoch, 13. August 2014 19:07 > ... > size_t utf8len( const char* s ) > { >size_t n = 0; >while (*s) > if ((*s++ ^ 0x40) < 0xC0) >n++; >return n; > } > ... > char *utf8skip( char const* s, size_t n ) > { >for ( ; n && *s; --n ) > while ((*++s ^ 0x40) >= 0xC0); >return (char*)s; > } These do not work if char is signed. -- Regards, Dietmar Schindler manroland web systems GmbH -- Managing Director: Joern Gossé Registered Office: Augsburg -- Trade Register: AG Augsburg -- HRB-No.: 26816 -- VAT: DE281389840 Confidentiality note: This eMail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any use or dissemination of this communication is strictly prohibited. If you have received this eMail in error, then please delete this eMail. ! Please consider your environmental responsibility before printing this eMail ! ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, Aug 13, 2014 at 07:06:38PM +0200, Harald Becker wrote: > Hi Denys ! > > > The world seems to be standardizing on utf-8. > >Thank God, supporting gazillion of encodings is no fun. > > You say this, but libbb/unicode.c contains a unicode_strlen calling > this complex mb to wc conversion function to count the number of > characters. Those multi byte functions tend to be highly complex and > slow (don't know if they have gone better). For just UTF-8, things > can be optimized. This depends on your libc. In musl, the only thing slow about them is having to account for some idiotic special-cases the standard allows (special meanings for null pointers in each of the arguments) and even that should not be slow on machines with proper branch prediction. > e.g. > > size_t utf8len( const char* s ) > { > size_t n = 0; > while (*s) > if ((*s++ ^ 0x40) < 0xC0) > n++; > return n; > } This function is only valid if the string is known to be valid UTF-8. Otherwise it hides errors, which may or may not be problematic depending on what you're using it for. > Another fast function I use for UTF-8 ... skip to Nth UTF-8 > character in a string (returns a pointer to trailing \0 if N > > number of UTF-8 chars in string): > > char *utf8skip( char const* s, size_t n ) > { > for ( ; n && *s; --n ) > while ((*++s ^ 0x40) >= 0xC0); > return (char*)s; > } This code is invalid; it's assuming char is unsigned. In practice, *++s ^ 0x40 is going to be negative on most archs. Better would be doing an unsigned range check like (unsigned char)*++s-0x80<0x40U. Of course it also gets tripped up badly on invalid sequences. Rich ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Rich! >> You say this, but libbb/unicode.c contains a unicode_strlen calling this complex mb to wc conversion function to count the number of characters. Those multi byte functions tend to be highly complex and slow (don't know if they have gone better). For just UTF-8, things can be optimized. This depends on your libc. ... that is, why I added "don't know if gone better" ... really good when musl is fast on this ... the problem is BB is more likely linked with glibc or uClibc ... there the results are not so great :( size_t utf8len( const char* s ) { size_t n = 0; while (*s) if ((*s++ ^ 0x40) < 0xC0) n++; return n; } This function is only valid if the string is known to be valid UTF-8. Yes, I told it's for UTF-8. Otherwise it hides errors, which may or may not be problematic depending on what you're using it for. If you know you are using UTF-8 you do not need to check every string over and over again, else it's pure paranoia. It is robust, as it will not run away on anything which is valid C string. Another fast function I use for UTF-8 ... skip to Nth UTF-8 character in a string (returns a pointer to trailing \0 if N > number of UTF-8 chars in string): char *utf8skip( char const* s, size_t n ) { for ( ; n && *s; --n ) while ((*++s ^ 0x40) >= 0xC0); return (char*)s; } This code is invalid; it's assuming char is unsigned. In practice, *++s ^ 0x40 is going to be negative on most archs. Better would be doing an unsigned range check like (unsigned char)*++s-0x80<0x40U. Yes, I missed the type cast ... sorry, for this, see previous mail Of course it also gets tripped up badly on invalid sequences. How can it get tripped? It silently skip over invalid sequences (of 0x80 to 0xBF until next leading of a sequence). It shall not get stuck in any way. Or tell me exactly how ... -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
size_t utf8len( const char* s ) { size_t n = 0; while (*s) if ((*s++ ^ 0x40) < 0xC0) n++; return n; } you need to test s != NULL, else *s will crash 2014-08-14 19:14 GMT+02:00 Harald Becker : > Hi Rich! > > > >> You say this, but libbb/unicode.c contains a unicode_strlen calling > >> this complex mb to wc conversion function to count the number of >>> characters. Those multi byte functions tend to be highly complex and >>> slow (don't know if they have gone better). For just UTF-8, things >>> can be optimized. >>> >> >> This depends on your libc. >> > > ... that is, why I added "don't know if gone better" ... really good when > musl is fast on this ... the problem is BB is more likely linked with glibc > or uClibc ... there the results are not so great :( > > > size_t utf8len( const char* s ) >>> { >>>size_t n = 0; >>>while (*s) >>> if ((*s++ ^ 0x40) < 0xC0) >>>n++; >>>return n; >>> } >>> >> >> This function is only valid if the string is known to be valid UTF-8. >> > > Yes, I told it's for UTF-8. > > > Otherwise it hides errors, which may or may not be problematic >> depending on what you're using it for. >> > > If you know you are using UTF-8 you do not need to check every string over > and over again, else it's pure paranoia. It is robust, as it will not run > away on anything which is valid C string. > > > Another fast function I use for UTF-8 ... skip to Nth UTF-8 >>> character in a string (returns a pointer to trailing \0 if N > >>> number of UTF-8 chars in string): >>> >>> char *utf8skip( char const* s, size_t n ) >>> { >>>for ( ; n && *s; --n ) >>> while ((*++s ^ 0x40) >= 0xC0); >>>return (char*)s; >>> } >>> >> >> This code is invalid; it's assuming char is unsigned. In practice, >> *++s ^ 0x40 is going to be negative on most archs. Better would be >> doing an unsigned range check like (unsigned char)*++s-0x80<0x40U. >> > > Yes, I missed the type cast ... sorry, for this, see previous mail > > > Of course it also gets tripped up badly on invalid sequences. >> > > How can it get tripped? It silently skip over invalid sequences (of 0x80 > to 0xBF until next leading of a sequence). It shall not get stuck in any > way. Or tell me exactly how ... > > -- > Harald > > > > ___ > busybox mailing list > busybox@busybox.net > http://lists.busybox.net/mailman/listinfo/busybox > ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Thu, Aug 14, 2014 at 07:16:36PM +0200, Tanguy Pruvot wrote: > size_t utf8len( const char* s ) > { >size_t n = 0; >while (*s) > if ((*s++ ^ 0x40) < 0xC0) >n++; >return n; > } > > you need to test s != NULL, else *s will crash Says who? NULL is not a valid pointer. Should you also check for things like s != (char *)-1 ? What value would you return then, anyway? Rich ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Thu, Aug 14, 2014 at 07:14:52PM +0200, Harald Becker wrote: > Hi Rich! > > >> You say this, but libbb/unicode.c contains a unicode_strlen calling > >>this complex mb to wc conversion function to count the number of > >>characters. Those multi byte functions tend to be highly complex and > >>slow (don't know if they have gone better). For just UTF-8, things > >>can be optimized. > > > >This depends on your libc. > > that is, why I added "don't know if gone better" ... really > good when musl is fast on this ... the problem is BB is more likely > linked with glibc or uClibc ... there the results are not so great > :( I think uClibc is pretty fast at this too. It's glibc that's horribly slow. Rough comparison: For processing a full string buffer, musl is roughly twice as fast as uClibc, and uClibc is roughly twice as fast as glibc. For byte-by-byte processing: musl is roughly 3x as fast as uClibc and roughly 4x as fast as glibc. Source: my comparison at http://www.etalabs.net/compare_libcs.html Presumably you would use a full string operation here (mbstowcs with null output pointer) for computing length in characters. > >>size_t utf8len( const char* s ) > >>{ > >> size_t n = 0; > >> while (*s) > >> if ((*s++ ^ 0x40) < 0xC0) > >> n++; > >> return n; > >>} > > > >This function is only valid if the string is known to be valid UTF-8. > > Yes, I told it's for UTF-8. Yes, but there's a difference between "nominally UTF-8" and "known-valid UTF-8". > >Otherwise it hides errors, which may or may not be problematic > >depending on what you're using it for. > > If you know you are using UTF-8 you do not need to check every > string over and over again, else it's pure paranoia. It is robust, > as it will not run away on anything which is valid C string. Well if the string comes from a source outside of your control, you need to check it at least once. But you might not want to check and reject it at the original point of input, e.g. if you want to be able to preserve arbitrary byte sequences that might not be UTF-8, e.g. an argument that's a filename in an invalid encoding which you're trying to delete or rename to fix. So IMO it makes a lot more sense to do your checking at the point of treating the string as a sequence of characters, even if it happens multiple times. The cost is not high if your implementation is efficient. > >Of course it also gets tripped up badly on invalid sequences. > > How can it get tripped? It silently skip over invalid sequences (of > 0x80 to 0xBF until next leading of a sequence). It shall not get > stuck in any way. Or tell me exactly how ... By itself it's not a problem, but the interaction with other code may be a problem if the other code does not follow exactly the same conventions. Rich ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi ! On 14.08.2014 19:16, Tanguy Pruvot wrote: you need to test s != NULL, else *s will crash It is like other str functions of the libc, you need to call the function with a valid pointer. ... else if you like add: "if (!s) retun 0;" ahead of the while -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Rich! > I think uClibc is pretty fast at this too. It's glibc that's horribly slow. Rough comparison: Pretty fast is still slower than UTF-8 optimized functions. For processing a full string buffer, musl is roughly twice as fast as uClibc, and uClibc is roughly twice as fast as glibc. You don't need to make ads for musl here, I' would like to see prebuild versions of BB statically linked with musl. Presumably you would use a full string operation here (mbstowcs with null output pointer) for computing length in characters. Do you remember my question only UTF-8 or full multi byte locale? It is exactly this decision. The former may be optimized the later more accurate. If you know you are using UTF-8 you do not need to check every string over and over again, else it's pure paranoia. It is robust, as it will not run away on anything which is valid C string. Well if the string comes from a source outside of your control, you need to check it at least once. But you might not want to check and reject it at the original point of input, e.g. if you want to be able to preserve arbitrary byte sequences that might not be UTF-8, e.g. an argument that's a filename in an invalid encoding which you're trying to delete or rename to fix. So IMO it makes a lot more sense to do your checking at the point of treating the string as a sequence of characters, even if it happens multiple times. The cost is not high if your implementation is efficient. ... and how want you behave in case of invalid UTF-8 sequences? My functions just skip over stray codes of 0x80..0xBF and synchronize on next valid UTF-8 leading byte. How would you count invalid sequences? Of course it also gets tripped up badly on invalid sequences. How can it get tripped? It silently skip over invalid sequences (of 0x80 to 0xBF until next leading of a sequence). It shall not get stuck in any way. Or tell me exactly how ... By itself it's not a problem, but the interaction with other code may be a problem if the other code does not follow exactly the same conventions. Sure, you can't mix multi byte functions with pure UTF-8 functions, you always need to look what type of function you call in your code. So what's different here. ... and the convention is just UTF-8 (even with invalid sequences) not a mixture with other multi byte codes. Not so much requirement of a convention? The functions have bean designed carefully to be not trapped on invalid sequences. I know they look extreme simple, but this is part of the optimization. ... remember: We are not talking about the ability to work with other multi byte locales. The assumption was pure ASCII or UTF-8. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Thu, Aug 14, 2014 at 09:09:02PM +0200, Harald Becker wrote: > Hi Rich! > > > I think uClibc is pretty fast at this too. It's glibc that's horribly > >slow. Rough comparison: > > Pretty fast is still slower than UTF-8 optimized functions. The standard functions certainly can be UTF-8-optimized, and they are in at least several implementations. I think glibc still has a pretty slow path to get to the UTF-8 decoding but hopefully that will be fixed eventually. I'll remind myself to pursue that in the future. > >For processing a full string buffer, musl is roughly twice as fast as > >uClibc, and uClibc is roughly twice as fast as glibc. > > You don't need to make ads for musl here, I' would like to see > prebuild versions of BB statically linked with musl. It's not an "ad". It's just pointing out that uClibc is probably not significantly slower for what you care about. My interpretation was that you trusted me that musl is fast here, but thought other more commonly used implementations might be slow, so I stated the relative speeds as a basis for evaluating that. > >Presumably you would use a full string operation here (mbstowcs with > >null output pointer) for computing length in characters. > > Do you remember my question only UTF-8 or full multi byte locale? It > is exactly this decision. The former may be optimized the later more > accurate. Yes I remember the question. Assuming the standard function has a fast path for UTF-8, which it should, the only reason to expect the standard multibyte functions to be significantly slower than your custom ones is that they detect illegal sequences rather than blindly assuming the input is valid. > >>If you know you are using UTF-8 you do not need to check every > >>string over and over again, else it's pure paranoia. It is robust, > >>as it will not run away on anything which is valid C string. > > > >Well if the string comes from a source outside of your control, you > >need to check it at least once. But you might not want to check and > >reject it at the original point of input, e.g. if you want to be able > >to preserve arbitrary byte sequences that might not be UTF-8, e.g. an > >argument that's a filename in an invalid encoding which you're trying > >to delete or rename to fix. So IMO it makes a lot more sense to do > >your checking at the point of treating the string as a sequence of > >characters, even if it happens multiple times. The cost is not high if > >your implementation is efficient. > > and how want you behave in case of invalid UTF-8 sequences? My > functions just skip over stray codes of 0x80..0xBF and synchronize > on next valid UTF-8 leading byte. How would you count invalid > sequences? In general, I would count the whole operation as a failure, returning some value such as -1 reserved for failure, since the string is not actually UTF-8 and thus "how many characters?" has no meaning. For specific uses, there might be other preferred behaviors. If your goal is display, you may want to simply replace illegal sequences with U+FFFD in which case you'd count each such sequence as "1", but if you're using this character-counting to allocate a buffer for the converted string, you need to be sure your conversion function and character-counting function agree on how illegal sequences are counted, or you might overflow your buffer or end up having to truncate the output. > >>>Of course it also gets tripped up badly on invalid sequences. > >> > >>How can it get tripped? It silently skip over invalid sequences (of > >>0x80 to 0xBF until next leading of a sequence). It shall not get > >>stuck in any way. Or tell me exactly how ... > > > >By itself it's not a problem, but the interaction with other code may > >be a problem if the other code does not follow exactly the same > >conventions. > > Sure, you can't mix multi byte functions with pure UTF-8 functions, > you always need to look what type of function you call in your code. > So what's different here. "Interaction with other code" was not about mixing your own pure UTF-8 functions with the standard C multibyte functions in possibly-non-UTF-8 locales. It was about mixing them with other code that's processing UTF-8 but handling errors differently. One such example would be the standard C multibyte functions when nl_langinfo(CODESET) has already been determined to be "UTF-8" (so you know they're processing UTF-8), but "pure UTF-8" code outside of the standard functions might also be handling errors differently from what you're doing, and mixing it with your handling _could_ be dangerous, depending on what you do. > and the convention is just UTF-8 (even with invalid sequences) > not a mixture with other multi byte codes. Not so much requirement > of a convention? > > The functions have bean designed carefully to be not trapped on > invalid sequences. I know they look extreme simple, but this is part > of the optimization. > > remember: We are not talking about the ability
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Rich, hi all, looks like we agree at most topics and tend to reach a point of a more philosophical discussion. There are only a few statements from you, I want to hop on: > It's not an "ad". Sorry Rich, forgot to add a smiley to the "ad" topic ;) ... I like your musl approach, except some very detailed decisions. >> and how want you behave in case of invalid UTF-8 sequences? My functions just skip over stray codes of 0x80..0xBF and synchronize on next valid UTF-8 leading byte. How would you count invalid sequences? In general, I would count the whole operation as a failure, returning some value such as -1 reserved for failure, since the string is not actually UTF-8 and thus "how many characters?" has no meaning. For specific uses, there might be other preferred behaviors. If your goal is display, you may want to simply replace illegal sequences with U+FFFD in which case you'd count each such sequence as "1", but if you're using this character-counting to allocate a buffer for the converted string, you need to be sure your conversion function and character-counting function agree on how illegal sequences are counted, or you might overflow your buffer or end up having to truncate the output. Rich, will you ever use the result of counting the numbers of UTF-8 characters to allocate a buffer? I don't think so. That would be very ill behavior. To allocate buffer space you need the number of bytes occupied by a string, not the number of UTF-8 characters. Beside this I prefer having really fast (and robust) functions for UTF-8, which may give somewhat incorrect result if input comes from an error prone source, but this result shall not break the program, as long they are used carefully. Otherwise you need to check for errors after each function call, which slows down operations additionally. I like it more the way, give the best result we can, even if things are broken, but continue with normal operation ... and at required points, it may be necessary to call utf8test() a function to test for validity, which neglects invalid UTF-8 strings. "Interaction with other code" was not about mixing your own pure UTF-8 functions with the standard C multibyte functions in possibly-non-UTF-8 locales. It was about mixing them with other code that's processing UTF-8 but handling errors differently. One such example would be the standard C multibyte functions when nl_langinfo(CODESET) has already been determined to be "UTF-8" (so you know they're processing UTF-8), but "pure UTF-8" code outside of the standard functions might also be handling errors differently from what you're doing, and mixing it with your handling _could_ be dangerous, depending on what you do. My functions are designed to be fast and robust. With error free UTF-8 they won't produce any errors, Otherwise they just try to give best result, even with invalid or damaged sequences. They just keep any sequence it has bean given, in the original order. If you mix those functions with other functions, the only convention beside not mixing UTF-8 with other multi byte codes is, those other functions has to be robust too, that is they don't have to be trapped by invalid sequences. So if you really like, you may freely mix my simplified UTF-8 functions with multi byte based UTF-8 processing which check every single character to be valid. The only failure would be to use unchecked UTF-8 strings for operations, which badly fail for invalid character sequences (not so much I know about). So I can't see where it get dangerous to mix my functions with others? I'm fine with assuming all data is nominally UTF-8. What's not fine is assuming that data which is nominally UTF-8 is actually valid UTF-8. I never say my functions need valid UTF-8. They just assume all data is either ASCII or nominally UTF-8, and try their best to operate on invalid sequences (either skipping or not breaking those illegal sequences). ... but this all is a philosophical discussion. BB started to use full multi byte locale functions, which is the more accurate way and even adds support for other multi byte character sets (may be welcome for everybody who needs them), so BB shall stay on this ... at least until one day everybody ask what this hole none UTF-8 stuff was about (whenever this will be). My only résumé is, BB shall disable all that Unicode/UTF-8 config stuff and always use full locale support of the library, giving information on how to configure/install known libs on a doc file. Either BB runs in full glibc environment, which has full multi byte locale functions, or BB may linked with a different lib which works at least for UTF-8 correct without any additional stuff to be installed. This would simplify configuration and code and will just rely on usage of a correct configured libc environment. So the big question is: Is there anybody who still needs the BB internal Unicode handling and can't use the locale func
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Wed, Aug 13, 2014 at 7:06 PM, Harald Becker wrote: >> The world seems to be standardizing on utf-8. >> >> Thank God, supporting gazillion of encodings is no fun. > > > You say this, but libbb/unicode.c contains a unicode_strlen calling this > complex mb to wc conversion function to count the number of characters. > Those multi byte functions tend to be highly complex and slow (don't know if > they have gone better). For just UTF-8, things can be optimized. bbox does have unicode-only implementation of mbstowc. See unicode.c > size_t utf8len( const char* s ) > { > size_t n = 0; > while (*s) > if ((*s++ ^ 0x40) < 0xC0) > n++; > return n; > } > > size_t mystrlen( const char* s ) > { > return utf8_enabled ? utf8len(s) : strlen(s); > } > > This looks more, but avoids inclusion of mb function. Most compiler shall > produce fast code for utf8len. There are situations where you need to do tons of unicode_strlen() and you can tolerate getting wrong results on broken Unicode. Then a function similar to yours can be very useful. -- vda ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Denys, my new system gets to a usable state, so I'm able to do more then just writing mails ... >> 4) applet printf, string formats "%Ns" Does this N mean character positions or bytes. The underlying C printf used to work with bytes for decades. The man page talks about character positions, but printf from bash uses bytes. Also, printf needs to support \u Do you like getting a patch for BB_process_escape_sequence to add Unicode sequences \u? -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
In bionic libc (android), wchar functions are stubs, mb functions also, the MAX_MB_LEN is set to 1. For wide chars, all chars are copied in the first byte of uint32_t wchar_t. So, all mbtowc, wctomb and mbstombs functions are useless. but the tools "vim" and "lv" are working perfectly with their own convertion functions, but these tools are big (1MB on arm, bb is only 550K in comparaison, 750K statically linked to the libc/libm/libselinux) references : https://github.com/tpruvot/android_external_vim https://github.com/tpruvot/android_external_lv https://github.com/tpruvot/android_external_busybox 2014-08-15 13:32 GMT+02:00 Harald Becker : > Hi Denys, > > my new system gets to a usable state, so I'm able to do more then just > writing mails ... > > > >> 4) applet printf, string formats "%Ns" > >> Does this N mean character positions or bytes. The underlying C printf >>> used >>> to work with bytes for decades. The man page talks about character >>> positions, but printf from bash uses bytes. >>> >> >> Also, printf needs to support \u >> > > Do you like getting a patch for BB_process_escape_sequence to add Unicode > sequences \u? > > -- > Harald > > > ___ > busybox mailing list > busybox@busybox.net > http://lists.busybox.net/mailman/listinfo/busybox > ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: Possible Unicode Problems in Busybox - Collect and Discussion
On Fri, Aug 15, 2014 at 12:31:15AM +0200, Harald Becker wrote: > >> and how want you behave in case of invalid UTF-8 sequences? My > >>functions just skip over stray codes of 0x80..0xBF and synchronize > >>on next valid UTF-8 leading byte. How would you count invalid > >>sequences? > > > >In general, I would count the whole operation as a failure, returning > >some value such as -1 reserved for failure, since the string is not > >actually UTF-8 and thus "how many characters?" has no meaning. For > >specific uses, there might be other preferred behaviors. If your goal > >is display, you may want to simply replace illegal sequences with > >U+FFFD in which case you'd count each such sequence as "1", but if > >you're using this character-counting to allocate a buffer for the > >converted string, you need to be sure your conversion function and > >character-counting function agree on how illegal sequences are > >counted, or you might overflow your buffer or end up having to > >truncate the output. > > Rich, will you ever use the result of counting the numbers of UTF-8 > characters to allocate a buffer? I don't think so. That would be > very ill behavior. To allocate buffer space you need the number of > bytes occupied by a string, not the number of UTF-8 characters. If your intent is to convert the string to UTF-32/wchar_t/whatever, then yes, you use the result for allocating a buffer. In my mind that's the main point of counting characters (since otherwise you usually care about either bytes, for storage, or columns, for presentation), and while I personally consider it better to work character-at-a-time and keep the string as UTF-8, some APIs require a string in a different format, especially ones that work with a whole string and prepare it for visual presentation. The main other place counting characters makes sense is for implementing languages that do substring operations with character indexes, which I think is the one you care about. > So the big question is: Is there anybody who still needs the BB > internal Unicode handling and can't use the locale functions of a > libc. Why and for what purpose is this needed? In which environment? I think the intent was to let uClibc users (and possibly eglibc users?) omit locale support from the libc, which reduces libc size quite a bit, and use the UTF-8 code in busybox instead. > As far as I know, the beginning of those BB internal functions, > where at times where only glibc had locale support and there where > no alternatives for small environments. But things changed and there > are now alternatives. So have we reached a point, where we are able > to simplify things in BB (which means to focus on correct mb > function usage everywhere and to strip unnecessary decisions, > configs and helper code)? I wouldn't object to this change. Rich ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox
Re: AW: Possible Unicode Problems in Busybox - Collect and Discussion
Hi Dietmar! On 14.08.2014 08:21, dietmar.schind...@manroland-web.com wrote: > These do not work if char is signed. You are right, I missed the type casts ... sorry size_t utf8len( const char* s ) { size_t n = 0; while (*s) if ((unsigned char)(*s++ ^ 0x40) < (unsigned char)0xC0) n++; return n; } char *utf8skip( char const* s, size_t n ) { for ( ; n && *s; --n ) while ((unsigned char)(*++s ^ 0x40) >= (unsigned char)0xC0); return (char*)s; } I know, most would prefer to use (unsigned char) ahead of *++s or *s++, but at least gcc gave better optimized x86 code for my type casts. -- Harald ___ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox