Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-15 Thread Denys Vlasenko
On Wed, Aug 13, 2014 at 7:06 PM, Harald Becker ra...@gmx.de wrote:
 The world seems to be standardizing on utf-8.

 Thank God, supporting gazillion of encodings is no fun.


 You say this, but libbb/unicode.c contains a unicode_strlen calling this
 complex mb to wc conversion function to count the number of characters.
 Those multi byte functions tend to be highly complex and slow (don't know if
 they have gone better). For just UTF-8, things can be optimized.

bbox does have unicode-only implementation of mbstowc.
See unicode.c

 size_t utf8len( const char* s )
 {
   size_t n = 0;
   while (*s)
 if ((*s++ ^ 0x40)  0xC0)
   n++;
   return n;
 }

 size_t mystrlen( const char* s )
 {
   return utf8_enabled ? utf8len(s) : strlen(s);
 }

 This looks more, but avoids inclusion of mb function. Most compiler shall
 produce fast code for utf8len.

There are situations where you need to do tons of unicode_strlen()
and you can tolerate getting wrong results on broken Unicode.
Then a function similar to yours can be very useful.

-- 
vda
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-15 Thread Harald Becker

Hi Denys,

my new system gets to a usable state, so I'm able to do more then just 
writing mails ...


 4) applet printf, string formats %Ns

Does this N mean character positions or bytes. The underlying C printf used
to work with bytes for decades. The man page talks about character
positions, but printf from bash uses bytes.


Also, printf needs to support \u


Do you like getting a patch for BB_process_escape_sequence to add 
Unicode sequences \u?


--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-15 Thread Tanguy Pruvot
In bionic libc (android), wchar functions are stubs, mb functions also, the
MAX_MB_LEN is set to 1.

For wide chars, all chars are copied in the first byte of uint32_t wchar_t.
So, all mbtowc, wctomb and mbstombs functions are useless.

but the tools vim and lv are working perfectly with their own
convertion functions, but these tools are big (1MB on arm, bb is only 550K
in comparaison, 750K statically linked to the libc/libm/libselinux)

references :
https://github.com/tpruvot/android_external_vim
https://github.com/tpruvot/android_external_lv
https://github.com/tpruvot/android_external_busybox


2014-08-15 13:32 GMT+02:00 Harald Becker ra...@gmx.de:

 Hi Denys,

 my new system gets to a usable state, so I'm able to do more then just
 writing mails ...


  4) applet printf, string formats %Ns

 Does this N mean character positions or bytes. The underlying C printf
 used
 to work with bytes for decades. The man page talks about character
 positions, but printf from bash uses bytes.


 Also, printf needs to support \u


 Do you like getting a patch for BB_process_escape_sequence to add Unicode
 sequences \u?

 --
 Harald


 ___
 busybox mailing list
 busybox@busybox.net
 http://lists.busybox.net/mailman/listinfo/busybox

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-15 Thread Rich Felker
On Fri, Aug 15, 2014 at 12:31:15AM +0200, Harald Becker wrote:
   and how want you behave in case of invalid UTF-8 sequences? My
 functions just skip over stray codes of 0x80..0xBF and synchronize
 on next valid UTF-8 leading byte. How would you count invalid
 sequences?
 
 In general, I would count the whole operation as a failure, returning
 some value such as -1 reserved for failure, since the string is not
 actually UTF-8 and thus how many characters? has no meaning. For
 specific uses, there might be other preferred behaviors. If your goal
 is display, you may want to simply replace illegal sequences with
 U+FFFD in which case you'd count each such sequence as 1, but if
 you're using this character-counting to allocate a buffer for the
 converted string, you need to be sure your conversion function and
 character-counting function agree on how illegal sequences are
 counted, or you might overflow your buffer or end up having to
 truncate the output.
 
 Rich, will you ever use the result of counting the numbers of UTF-8
 characters to allocate a buffer? I don't think so. That would be
 very ill behavior. To allocate buffer space you need the number of
 bytes occupied by a string, not the number of UTF-8 characters.

If your intent is to convert the string to UTF-32/wchar_t/whatever,
then yes, you use the result for allocating a buffer. In my mind
that's the main point of counting characters (since otherwise you
usually care about either bytes, for storage, or columns, for
presentation), and while I personally consider it better to work
character-at-a-time and keep the string as UTF-8, some APIs require a
string in a different format, especially ones that work with a whole
string and prepare it for visual presentation.

The main other place counting characters makes sense is for
implementing languages that do substring operations with character
indexes, which I think is the one you care about.

 So the big question is: Is there anybody who still needs the BB
 internal Unicode handling and can't use the locale functions of a
 libc. Why and for what purpose is this needed? In which environment?

I think the intent was to let uClibc users (and possibly eglibc
users?) omit locale support from the libc, which reduces libc size
quite a bit, and use the UTF-8 code in busybox instead.

 As far as I know, the beginning of those BB internal functions,
 where at times where only glibc had locale support and there where
 no alternatives for small environments. But things changed and there
 are now alternatives. So have we reached a point, where we are able
 to simplify things in BB (which means to focus on correct mb
 function usage everywhere and to strip unnecessary decisions,
 configs and helper code)?

I wouldn't object to this change.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


AW: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread dietmar.schindler
 Von: Harald Becker
 Gesendet: Mittwoch, 13. August 2014 19:07
 ...
 size_t utf8len( const char* s )
 {
size_t n = 0;
while (*s)
  if ((*s++ ^ 0x40)  0xC0)
n++;
return n;
 }
 ...
 char *utf8skip( char const* s, size_t n )
 {
for ( ; n  *s; --n )
  while ((*++s ^ 0x40) = 0xC0);
return (char*)s;
 }

These do not work if char is signed.
--
Regards,
Dietmar Schindler

manroland web systems GmbH -- Managing Director: Joern Gossé
Registered Office: Augsburg -- Trade Register: AG Augsburg -- HRB-No.: 26816 -- 
VAT: DE281389840

Confidentiality note:
This eMail and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, you are hereby notified that any use or 
dissemination of this communication is strictly prohibited. If you have 
received this eMail in error, then please delete this eMail.

! Please consider your environmental responsibility before printing this eMail !

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Rich Felker
On Wed, Aug 13, 2014 at 07:06:38PM +0200, Harald Becker wrote:
 Hi Denys !
 
  The world seems to be standardizing on utf-8.
 Thank God, supporting gazillion of encodings is no fun.
 
 You say this, but libbb/unicode.c contains a unicode_strlen calling
 this complex mb to wc conversion function to count the number of
 characters. Those multi byte functions tend to be highly complex and
 slow (don't know if they have gone better). For just UTF-8, things
 can be optimized.

This depends on your libc. In musl, the only thing slow about them is
having to account for some idiotic special-cases the standard allows
(special meanings for null pointers in each of the arguments) and even
that should not be slow on machines with proper branch prediction.

 e.g.
 
 size_t utf8len( const char* s )
 {
   size_t n = 0;
   while (*s)
 if ((*s++ ^ 0x40)  0xC0)
   n++;
   return n;
 }

This function is only valid if the string is known to be valid UTF-8.
Otherwise it hides errors, which may or may not be problematic
depending on what you're using it for.

 Another fast function I use for UTF-8 ... skip to Nth UTF-8
 character in a string (returns a pointer to trailing \0 if N 
 number of UTF-8 chars in string):
 
 char *utf8skip( char const* s, size_t n )
 {
   for ( ; n  *s; --n )
 while ((*++s ^ 0x40) = 0xC0);
   return (char*)s;
 }

This code is invalid; it's assuming char is unsigned. In practice,
*++s ^ 0x40 is going to be negative on most archs. Better would be
doing an unsigned range check like (unsigned char)*++s-0x800x40U.

Of course it also gets tripped up badly on invalid sequences.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: AW: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Harald Becker

Hi Dietmar!

On 14.08.2014 08:21, dietmar.schind...@manroland-web.com wrote:
 These do not work if char is signed.
You are right, I missed the type casts ... sorry


size_t utf8len( const char* s )
{
   size_t n = 0;
   while (*s)
 if ((unsigned char)(*s++ ^ 0x40)  (unsigned char)0xC0)
   n++;
return n;
}

char *utf8skip( char const* s, size_t n )
{
   for ( ; n  *s; --n )
  while ((unsigned char)(*++s ^ 0x40) = (unsigned char)0xC0);
   return (char*)s;
}


I know, most would prefer to use (unsigned char) ahead of *++s or *s++, 
but at least gcc gave better optimized x86 code for my type casts.


--
Harald
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Harald Becker

Hi Rich!

 You say this, but libbb/unicode.c contains a unicode_strlen calling

this complex mb to wc conversion function to count the number of
characters. Those multi byte functions tend to be highly complex and
slow (don't know if they have gone better). For just UTF-8, things
can be optimized.


This depends on your libc.


... that is, why I added don't know if gone better ... really good 
when musl is fast on this ... the problem is BB is more likely linked 
with glibc or uClibc ... there the results are not so great :(



size_t utf8len( const char* s )
{
   size_t n = 0;
   while (*s)
 if ((*s++ ^ 0x40)  0xC0)
   n++;
   return n;
}


This function is only valid if the string is known to be valid UTF-8.


Yes, I told it's for UTF-8.


Otherwise it hides errors, which may or may not be problematic
depending on what you're using it for.


If you know you are using UTF-8 you do not need to check every string 
over and over again, else it's pure paranoia. It is robust, as it will 
not run away on anything which is valid C string.



Another fast function I use for UTF-8 ... skip to Nth UTF-8
character in a string (returns a pointer to trailing \0 if N 
number of UTF-8 chars in string):

char *utf8skip( char const* s, size_t n )
{
   for ( ; n  *s; --n )
 while ((*++s ^ 0x40) = 0xC0);
   return (char*)s;
}


This code is invalid; it's assuming char is unsigned. In practice,
*++s ^ 0x40 is going to be negative on most archs. Better would be
doing an unsigned range check like (unsigned char)*++s-0x800x40U.


Yes, I missed the type cast ... sorry, for this, see previous mail


Of course it also gets tripped up badly on invalid sequences.


How can it get tripped? It silently skip over invalid sequences (of 0x80 
to 0xBF until next leading of a sequence). It shall not get stuck in any 
way. Or tell me exactly how ...


--
Harald


___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Tanguy Pruvot
size_t utf8len( const char* s )
{
   size_t n = 0;
   while (*s)
 if ((*s++ ^ 0x40)  0xC0)
   n++;
   return n;
}

you need to test s != NULL, else *s will crash


2014-08-14 19:14 GMT+02:00 Harald Becker ra...@gmx.de:

 Hi Rich!


  You say this, but libbb/unicode.c contains a unicode_strlen calling

 this complex mb to wc conversion function to count the number of
 characters. Those multi byte functions tend to be highly complex and
 slow (don't know if they have gone better). For just UTF-8, things
 can be optimized.


 This depends on your libc.


 ... that is, why I added don't know if gone better ... really good when
 musl is fast on this ... the problem is BB is more likely linked with glibc
 or uClibc ... there the results are not so great :(


  size_t utf8len( const char* s )
 {
size_t n = 0;
while (*s)
  if ((*s++ ^ 0x40)  0xC0)
n++;
return n;
 }


 This function is only valid if the string is known to be valid UTF-8.


 Yes, I told it's for UTF-8.


  Otherwise it hides errors, which may or may not be problematic
 depending on what you're using it for.


 If you know you are using UTF-8 you do not need to check every string over
 and over again, else it's pure paranoia. It is robust, as it will not run
 away on anything which is valid C string.


  Another fast function I use for UTF-8 ... skip to Nth UTF-8
 character in a string (returns a pointer to trailing \0 if N 
 number of UTF-8 chars in string):

 char *utf8skip( char const* s, size_t n )
 {
for ( ; n  *s; --n )
  while ((*++s ^ 0x40) = 0xC0);
return (char*)s;
 }


 This code is invalid; it's assuming char is unsigned. In practice,
 *++s ^ 0x40 is going to be negative on most archs. Better would be
 doing an unsigned range check like (unsigned char)*++s-0x800x40U.


 Yes, I missed the type cast ... sorry, for this, see previous mail


  Of course it also gets tripped up badly on invalid sequences.


 How can it get tripped? It silently skip over invalid sequences (of 0x80
 to 0xBF until next leading of a sequence). It shall not get stuck in any
 way. Or tell me exactly how ...

 --
 Harald



 ___
 busybox mailing list
 busybox@busybox.net
 http://lists.busybox.net/mailman/listinfo/busybox

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Rich Felker
On Thu, Aug 14, 2014 at 07:16:36PM +0200, Tanguy Pruvot wrote:
 size_t utf8len( const char* s )
 {
size_t n = 0;
while (*s)
  if ((*s++ ^ 0x40)  0xC0)
n++;
return n;
 }
 
 you need to test s != NULL, else *s will crash

Says who? NULL is not a valid pointer. Should you also check for
things like s != (char *)-1 ? What value would you return then,
anyway?

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Rich Felker
On Thu, Aug 14, 2014 at 07:14:52PM +0200, Harald Becker wrote:
 Hi Rich!
 
  You say this, but libbb/unicode.c contains a unicode_strlen calling
 this complex mb to wc conversion function to count the number of
 characters. Those multi byte functions tend to be highly complex and
 slow (don't know if they have gone better). For just UTF-8, things
 can be optimized.
 
 This depends on your libc.
 
  that is, why I added don't know if gone better ... really
 good when musl is fast on this ... the problem is BB is more likely
 linked with glibc or uClibc ... there the results are not so great
 :(

I think uClibc is pretty fast at this too. It's glibc that's horribly
slow. Rough comparison:

For processing a full string buffer, musl is roughly twice as fast as
uClibc, and uClibc is roughly twice as fast as glibc.

For byte-by-byte processing: musl is roughly 3x as fast as uClibc and
roughly 4x as fast as glibc.

Source: my comparison at http://www.etalabs.net/compare_libcs.html

Presumably you would use a full string operation here (mbstowcs with
null output pointer) for computing length in characters.

 size_t utf8len( const char* s )
 {
size_t n = 0;
while (*s)
  if ((*s++ ^ 0x40)  0xC0)
n++;
return n;
 }
 
 This function is only valid if the string is known to be valid UTF-8.
 
 Yes, I told it's for UTF-8.

Yes, but there's a difference between nominally UTF-8 and
known-valid UTF-8.

 Otherwise it hides errors, which may or may not be problematic
 depending on what you're using it for.
 
 If you know you are using UTF-8 you do not need to check every
 string over and over again, else it's pure paranoia. It is robust,
 as it will not run away on anything which is valid C string.

Well if the string comes from a source outside of your control, you
need to check it at least once. But you might not want to check and
reject it at the original point of input, e.g. if you want to be able
to preserve arbitrary byte sequences that might not be UTF-8, e.g. an
argument that's a filename in an invalid encoding which you're trying
to delete or rename to fix. So IMO it makes a lot more sense to do
your checking at the point of treating the string as a sequence of
characters, even if it happens multiple times. The cost is not high if
your implementation is efficient.

 Of course it also gets tripped up badly on invalid sequences.
 
 How can it get tripped? It silently skip over invalid sequences (of
 0x80 to 0xBF until next leading of a sequence). It shall not get
 stuck in any way. Or tell me exactly how ...

By itself it's not a problem, but the interaction with other code may
be a problem if the other code does not follow exactly the same
conventions.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Harald Becker

Hi !

On 14.08.2014 19:16, Tanguy Pruvot wrote:


you need to test s != NULL, else *s will crash


It is like other str functions of the libc, you need to call the 
function with a valid pointer.


... else if you like add: if (!s) retun 0; ahead of the while

--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Harald Becker

Hi Rich!

 I think uClibc is pretty fast at this too. It's glibc that's horribly

slow. Rough comparison:


Pretty fast is still slower than UTF-8 optimized functions.


For processing a full string buffer, musl is roughly twice as fast as
uClibc, and uClibc is roughly twice as fast as glibc.


You don't need to make ads for musl here, I' would like to see prebuild 
versions of BB statically linked with musl.



Presumably you would use a full string operation here (mbstowcs with
null output pointer) for computing length in characters.


Do you remember my question only UTF-8 or full multi byte locale? It is 
exactly this decision. The former may be optimized the later more accurate.




If you know you are using UTF-8 you do not need to check every
string over and over again, else it's pure paranoia. It is robust,
as it will not run away on anything which is valid C string.


Well if the string comes from a source outside of your control, you
need to check it at least once. But you might not want to check and
reject it at the original point of input, e.g. if you want to be able
to preserve arbitrary byte sequences that might not be UTF-8, e.g. an
argument that's a filename in an invalid encoding which you're trying
to delete or rename to fix. So IMO it makes a lot more sense to do
your checking at the point of treating the string as a sequence of
characters, even if it happens multiple times. The cost is not high if
your implementation is efficient.


... and how want you behave in case of invalid UTF-8 sequences? My 
functions just skip over stray codes of 0x80..0xBF and synchronize on 
next valid UTF-8 leading byte. How would you count invalid sequences?




Of course it also gets tripped up badly on invalid sequences.


How can it get tripped? It silently skip over invalid sequences (of
0x80 to 0xBF until next leading of a sequence). It shall not get
stuck in any way. Or tell me exactly how ...


By itself it's not a problem, but the interaction with other code may
be a problem if the other code does not follow exactly the same
conventions.


Sure, you can't mix multi byte functions with pure UTF-8 functions, you 
always need to look what type of function you call in your code. So 
what's different here.


... and the convention is just UTF-8 (even with invalid sequences) not a 
mixture with other multi byte codes. Not so much requirement of a 
convention?


The functions have bean designed carefully to be not trapped on invalid 
sequences. I know they look extreme simple, but this is part of the 
optimization.


... remember: We are not talking about the ability to work with other 
multi byte locales. The assumption was pure ASCII or UTF-8.


--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Rich Felker
On Thu, Aug 14, 2014 at 09:09:02PM +0200, Harald Becker wrote:
 Hi Rich!
 
  I think uClibc is pretty fast at this too. It's glibc that's horribly
 slow. Rough comparison:
 
 Pretty fast is still slower than UTF-8 optimized functions.

The standard functions certainly can be UTF-8-optimized, and they are
in at least several implementations. I think glibc still has a pretty
slow path to get to the UTF-8 decoding but hopefully that will be
fixed eventually. I'll remind myself to pursue that in the future.

 For processing a full string buffer, musl is roughly twice as fast as
 uClibc, and uClibc is roughly twice as fast as glibc.
 
 You don't need to make ads for musl here, I' would like to see
 prebuild versions of BB statically linked with musl.

It's not an ad. It's just pointing out that uClibc is probably not
significantly slower for what you care about. My interpretation was
that you trusted me that musl is fast here, but thought other more
commonly used implementations might be slow, so I stated the relative
speeds as a basis for evaluating that.

 Presumably you would use a full string operation here (mbstowcs with
 null output pointer) for computing length in characters.
 
 Do you remember my question only UTF-8 or full multi byte locale? It
 is exactly this decision. The former may be optimized the later more
 accurate.

Yes I remember the question. Assuming the standard function has a fast
path for UTF-8, which it should, the only reason to expect the
standard multibyte functions to be significantly slower than your
custom ones is that they detect illegal sequences rather than blindly
assuming the input is valid.

 If you know you are using UTF-8 you do not need to check every
 string over and over again, else it's pure paranoia. It is robust,
 as it will not run away on anything which is valid C string.
 
 Well if the string comes from a source outside of your control, you
 need to check it at least once. But you might not want to check and
 reject it at the original point of input, e.g. if you want to be able
 to preserve arbitrary byte sequences that might not be UTF-8, e.g. an
 argument that's a filename in an invalid encoding which you're trying
 to delete or rename to fix. So IMO it makes a lot more sense to do
 your checking at the point of treating the string as a sequence of
 characters, even if it happens multiple times. The cost is not high if
 your implementation is efficient.
 
  and how want you behave in case of invalid UTF-8 sequences? My
 functions just skip over stray codes of 0x80..0xBF and synchronize
 on next valid UTF-8 leading byte. How would you count invalid
 sequences?

In general, I would count the whole operation as a failure, returning
some value such as -1 reserved for failure, since the string is not
actually UTF-8 and thus how many characters? has no meaning. For
specific uses, there might be other preferred behaviors. If your goal
is display, you may want to simply replace illegal sequences with
U+FFFD in which case you'd count each such sequence as 1, but if
you're using this character-counting to allocate a buffer for the
converted string, you need to be sure your conversion function and
character-counting function agree on how illegal sequences are
counted, or you might overflow your buffer or end up having to
truncate the output.

 Of course it also gets tripped up badly on invalid sequences.
 
 How can it get tripped? It silently skip over invalid sequences (of
 0x80 to 0xBF until next leading of a sequence). It shall not get
 stuck in any way. Or tell me exactly how ...
 
 By itself it's not a problem, but the interaction with other code may
 be a problem if the other code does not follow exactly the same
 conventions.
 
 Sure, you can't mix multi byte functions with pure UTF-8 functions,
 you always need to look what type of function you call in your code.
 So what's different here.

Interaction with other code was not about mixing your own pure UTF-8
functions with the standard C multibyte functions in
possibly-non-UTF-8 locales. It was about mixing them with other code
that's processing UTF-8 but handling errors differently. One such
example would be the standard C multibyte functions when
nl_langinfo(CODESET) has already been determined to be UTF-8 (so you
know they're processing UTF-8), but pure UTF-8 code outside of the
standard functions might also be handling errors differently from what
you're doing, and mixing it with your handling _could_ be dangerous,
depending on what you do.

  and the convention is just UTF-8 (even with invalid sequences)
 not a mixture with other multi byte codes. Not so much requirement
 of a convention?
 
 The functions have bean designed carefully to be not trapped on
 invalid sequences. I know they look extreme simple, but this is part
 of the optimization.
 
  remember: We are not talking about the ability to work with
 other multi byte locales. The assumption was pure ASCII or UTF-8.

I'm fine with assuming 

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-14 Thread Harald Becker

Hi Rich,
hi all,

looks like we agree at most topics and tend to reach a point of a more 
philosophical discussion. There are only a few statements from you, I 
want to hop on:


 It's not an ad.

Sorry Rich, forgot to add a smiley to the ad topic ;) ... I like your 
musl approach, except some very detailed decisions.


  and how want you behave in case of invalid UTF-8 sequences? My

functions just skip over stray codes of 0x80..0xBF and synchronize
on next valid UTF-8 leading byte. How would you count invalid
sequences?


In general, I would count the whole operation as a failure, returning
some value such as -1 reserved for failure, since the string is not
actually UTF-8 and thus how many characters? has no meaning. For
specific uses, there might be other preferred behaviors. If your goal
is display, you may want to simply replace illegal sequences with
U+FFFD in which case you'd count each such sequence as 1, but if
you're using this character-counting to allocate a buffer for the
converted string, you need to be sure your conversion function and
character-counting function agree on how illegal sequences are
counted, or you might overflow your buffer or end up having to
truncate the output.


Rich, will you ever use the result of counting the numbers of UTF-8 
characters to allocate a buffer? I don't think so. That would be very 
ill behavior. To allocate buffer space you need the number of bytes 
occupied by a string, not the number of UTF-8 characters.


Beside this I prefer having really fast (and robust) functions for 
UTF-8, which may give somewhat incorrect result if input comes from an 
error prone source, but this result shall not break the program, as long 
they are used carefully. Otherwise you need to check for errors after 
each function call, which slows down operations additionally.


I like it more the way, give the best result we can, even if things are 
broken, but continue with normal operation ... and at required points, 
it may be necessary to call utf8test() a function to test for validity, 
which neglects invalid UTF-8 strings.



Interaction with other code was not about mixing your own pure UTF-8
functions with the standard C multibyte functions in
possibly-non-UTF-8 locales. It was about mixing them with other code
that's processing UTF-8 but handling errors differently. One such
example would be the standard C multibyte functions when
nl_langinfo(CODESET) has already been determined to be UTF-8 (so you
know they're processing UTF-8), but pure UTF-8 code outside of the
standard functions might also be handling errors differently from what
you're doing, and mixing it with your handling _could_ be dangerous,
depending on what you do.


My functions are designed to be fast and robust. With error free UTF-8 
they won't produce any errors, Otherwise they just try to give best 
result, even with invalid or damaged sequences. They just keep any 
sequence it has bean given, in the original order. If you mix those 
functions with other functions, the only convention beside not mixing 
UTF-8 with other multi byte codes is, those other functions has to be 
robust too, that is they don't have to be trapped by invalid sequences. 
So if you really like, you may freely mix my simplified UTF-8 functions 
with multi byte based UTF-8 processing which check every single 
character to be valid. The only failure would be to use unchecked UTF-8 
strings for operations, which badly fail for invalid character sequences 
(not so much I know about). So I can't see where it get dangerous to mix 
my functions with others?



I'm fine with assuming all data is nominally UTF-8. What's not fine is
assuming that data which is nominally UTF-8 is actually valid UTF-8.


I never say my functions need valid UTF-8. They just assume all data is 
either ASCII or nominally UTF-8, and try their best to operate on 
invalid sequences (either skipping or not breaking those illegal sequences).


... but this all is a philosophical discussion. BB started to use full 
multi byte locale functions, which is the more accurate way and even 
adds support for other multi byte character sets (may be welcome for 
everybody who needs them), so BB shall stay on this ... at least until 
one day everybody ask what this hole none UTF-8 stuff was about 
(whenever this will be).



My only résumé is, BB shall disable all that Unicode/UTF-8 config stuff 
and always use full locale support of the library, giving information on 
how to configure/install known libs on a doc file. Either BB runs in 
full glibc environment, which has full multi byte locale functions, or 
BB may linked with a different lib which works at least for UTF-8 
correct without any additional stuff to be installed. This would 
simplify configuration and code and will just rely on usage of a correct 
configured libc environment.


So the big question is: Is there anybody who still needs the BB internal 
Unicode handling and can't use the locale functions of a libc. 

Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi All !

I start this thread to collect and discuss the possible Unicode (UTF-8) 
problems we detected and which may need further investigation:



1) sed s/./x/ the dot matches bytes not characters
This at least hits uClibc builds, glibc seam to work correct with full 
set of locale files.


This bug may also affect other applets using regular expressions.


2) shell substitution ${#var}
Does this length operation shall give the number of bytes in var or the 
number of characters (which may differ for multi byte characters, like 
UTF-8).



3) applet expr, function length STRING
This also may hit the *index*, *substr* and *match* functions. Do we 
look at positions of character or at byte positions. What does the specs 
say on this?



4) applet printf, string formats %Ns
Does this N mean character positions or bytes. The underlying C printf 
used to work with bytes for decades. The man page talks about character 
positions, but printf from bash uses bytes.



5) applet awk, function length()
This may also hit other string functions, like *index*, *match*, 
*substr*, *sub*, etc.


Those functions used to work with byte positions for decades, neglecting 
multi byte characters. The specs don't seam to be concrete on this. 
Changing things may break many existing scripts!



Do we have further points where we get hit on this topic?

--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Additional commands which may be hit by this question:

cut -c, -f

fold -w
Looks as BB does it right, but different from upstream.

sort, position specification


___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Tanguy Pruvot
just remember utf-8 is not related to wchar, its just a serie of chars
displayed as a single column.

ive seen several implementations which use mbtowc functions to test some
special chars, this is not correct for utf 8 in my opinion.

if cut fields supports strings bigger than a single char, there should be
no problem, the serie is found in input text.


2014-08-13 14:10 GMT+02:00 Harald Becker ra...@gmx.de:

 Additional commands which may be hit by this question:

 cut -c, -f

 fold -w
 Looks as BB does it right, but different from upstream.

 sort, position specification



 ___
 busybox mailing list
 busybox@busybox.net
 http://lists.busybox.net/mailman/listinfo/busybox

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi !

 if cut fields supports strings bigger than a single char, there
 should be no problem, the serie is found in input text.

$ echo -n äöü | hd
  c3 a4 c3 b6 c3 bc

$ echo -n äöü | cut -c1 | hd
  c3 0a

$ echo -n äöü | cut -c2 | hd
  a4 0a

This shows the position given with cut -c does not pick the correct 
character. BB same as upstream.


cut has a -b option to specify the byte position, but -c is called to 
use character positions. So I expect either -c1 (when counted from zero) 
or -c2 (when counted from one) to omit the ö (oumlaut) from the input 
text.


--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Tanguy Pruvot
in this case yes indeed, my mblen() function posted some days ago could be
used to prevent display of cutted char series.

The real problem with unicode is utf-16 which contains \0 chars (but its
another and uncommon problem)



2014-08-13 15:17 GMT+02:00 Harald Becker ra...@gmx.de:

 Hi !


  if cut fields supports strings bigger than a single char, there
  should be no problem, the serie is found in input text.

 $ echo -n äöü | hd
   c3 a4 c3 b6 c3 bc

 $ echo -n äöü | cut -c1 | hd
   c3 0a

 $ echo -n äöü | cut -c2 | hd
   a4 0a

 This shows the position given with cut -c does not pick the correct
 character. BB same as upstream.

 cut has a -b option to specify the byte position, but -c is called to use
 character positions. So I expect either -c1 (when counted from zero) or -c2
 (when counted from one) to omit the ö (oumlaut) from the input text.

 --
 Harald


___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker



ive seen several implementations which use mbtowc functions to test some
special chars, this is not correct for utf 8 in my opinion.


To count the number of UTF-8 characters is really simple, just count all 
bytes except those with value in range 0x80 to 0xBF. This has two 
exceptions 0xFE and 0xFF which are no official UTF-8 characters, but I 
think it's not wrong to count and behave as such.



counting can be done with one logical an one compare instruction:

if ((c ^ 0x40)  0xC0) n++


___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker



The real problem with unicode is utf-16 which contains \0 chars (but its
another and uncommon problem)


This unveils an interesting question: Do we want to add UTF-8 support to 
BB or full multi byte support. The former may be simpler, the later more 
correct.



___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Denys Vlasenko
On Wed, Aug 13, 2014 at 1:40 PM, Harald Becker ra...@gmx.de wrote:
 2) shell substitution ${#var}
 Does this length operation shall give the number of bytes in var or the
 number of characters (which may differ for multi byte characters, like
 UTF-8).

bash gives number of Unicode chars.
I just fixed both ash and hush to do the same.

 4) applet printf, string formats %Ns
 Does this N mean character positions or bytes. The underlying C printf used
 to work with bytes for decades. The man page talks about character
 positions, but printf from bash uses bytes.

Also, printf needs to support \u

-- 
vda
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Denys Vlasenko
On Wed, Aug 13, 2014 at 3:42 PM, Harald Becker ra...@gmx.de wrote:

 ive seen several implementations which use mbtowc functions to test some
 special chars, this is not correct for utf 8 in my opinion.


 To count the number of UTF-8 characters is really simple, just count all
 bytes except those with value in range 0x80 to 0xBF. This has two exceptions
 0xFE and 0xFF which are no official UTF-8 characters, but I think it's not
 wrong to count and behave as such.


 counting can be done with one logical an one compare instruction:

 if ((c ^ 0x40)  0xC0) n++

include/{libbb,unicode}.h already have a bunch of helpers
to do unicode_strlen(), and a few other typical functions:

typedef struct uni_stat_t {
unsigned byte_count;
unsigned unicode_count;
unsigned unicode_width;
} uni_stat_t;
/* Returns a string with unprintable chars replaced by '?' or
 * SUBST_WCHAR. This function is unicode-aware. */
const char* FAST_FUNC printable_string(uni_stat_t *stats, const char *str);

/* Number of unicode chars. Falls back to strlen() on invalid unicode */
size_t FAST_FUNC unicode_strlen(const char *string);
/* Width on terminal */
size_t FAST_FUNC unicode_strwidth(const char *string);
enum {
UNI_FLAG_PAD = (1  0),
};
char* FAST_FUNC unicode_conv_to_printable(uni_stat_t *stats, const char *src);
char* FAST_FUNC unicode_conv_to_printable_fixedwidth(/*uni_stat_t
*stats,*/ const char *src, unsigned width);
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Denys Vlasenko
On Wed, Aug 13, 2014 at 4:01 PM, Harald Becker ra...@gmx.de wrote:

 The real problem with unicode is utf-16 which contains \0 chars (but its
 another and uncommon problem)


 This unveils an interesting question: Do we want to add UTF-8 support to BB
 or full multi byte support. The former may be simpler, the later more
 correct.

The world seems to be standardizing on utf-8.

Thank God, supporting gazillion of encodings is no fun.
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi Denys!


This unveils an interesting question: Do we want to add UTF-8 support to BB
or full multi byte support. The former may be simpler, the later more
correct.


The world seems to be standardizing on utf-8.
Thank God, supporting gazillion of encodings is no fun.


Full ACK.

--
Harald


___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi Denys !

 The world seems to be standardizing on utf-8.

Thank God, supporting gazillion of encodings is no fun.


You say this, but libbb/unicode.c contains a unicode_strlen calling this 
complex mb to wc conversion function to count the number of characters. 
Those multi byte functions tend to be highly complex and slow (don't 
know if they have gone better). For just UTF-8, things can be optimized.


e.g.

size_t utf8len( const char* s )
{
  size_t n = 0;
  while (*s)
if ((*s++ ^ 0x40)  0xC0)
  n++;
  return n;
}

size_t mystrlen( const char* s )
{
  return utf8_enabled ? utf8len(s) : strlen(s);
}

This looks more, but avoids inclusion of mb function. Most compiler 
shall produce fast code for utf8len.


utf8len is for UTF-8 only usage, mystrlen may be used to switch betwean 
8-bit-locale and UTF-8. If we could switch to UTF-8 only, we may forget 
of mystrlen and always use utf8len.



Another fast function I use for UTF-8 ... skip to Nth UTF-8 character in 
a string (returns a pointer to trailing \0 if N  number of UTF-8 chars 
in string):


char *utf8skip( char const* s, size_t n )
{
  for ( ; n  *s; --n )
while ((*++s ^ 0x40) = 0xC0);
  return (char*)s;
}


Those are examples, other functions may also be optimized. It all 
depends on the question if those darn big mb functions shall be used or not.


--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi Denys !


2) shell substitution ${#var}
Does this length operation shall give the number of bytes in var or the
number of characters (which may differ for multi byte characters, like
UTF-8).


bash gives number of Unicode chars.
I just fixed both ash and hush to do the same.


bash seams to be the only shell which does this. So is this a bash-ism?

... and expr length $var (upstream) still return the size in bytes not 
characters.


--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi Denys !


2) shell substitution ${#var}
Does this length operation shall give the number of bytes in var or the
number of characters (which may differ for multi byte characters, like
UTF-8).


bash gives number of Unicode chars.
I just fixed both ash and hush to do the same.


Add a big warning in release notes, I ought this is a shell script 
breaker. Shell scripts which rely on getting the number of bytes may now 
fail, because they allocate space or copy less characters.


--
Harald

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Paul Smith
On Wed, 2014-08-13 at 19:23 +0200, Harald Becker wrote:
  bash gives number of Unicode chars.
  I just fixed both ash and hush to do the same.
 
 bash seams to be the only shell which does this. So is this a
 bash-ism?

The POSIX standard says that ${#var} give the length of variable var in
characters.  I can't find, offhand, a definition of characters in the
standard, but one would assume that if they meant bytes they would say
that... ?

Probably worth a question to the POSIX folks for a clarification.

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Paul Smith
On Wed, 2014-08-13 at 13:52 -0400, Paul Smith wrote:
 The POSIX standard says that ${#var} give the length of variable var
 in characters.  I can't find, offhand, a definition of characters
 in the standard

D'oh!  It was only in the most obvious place:

3.87 Character

A sequence of one or more bytes representing a single graphic
symbol or control code.

So it seems like bash has this right.

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

On 13.08.2014 19:56, Paul Smith wrote:

On Wed, 2014-08-13 at 13:52 -0400, Paul Smith wrote:

The POSIX standard says that ${#var} give the length of variable var
in characters.  I can't find, offhand, a definition of characters
in the standard


D'oh!  It was only in the most obvious place:

 3.87 Character

 A sequence of one or more bytes representing a single graphic
 symbol or control code.

So it seems like bash has this right.



___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi Paul !

 The POSIX standard says that ${#var} give the length of variable var

in characters.  I can't find, offhand, a definition of characters
in the standard


D'oh!  It was only in the most obvious place:

 3.87 Character

 A sequence of one or more bytes representing a single graphic
 symbol or control code.

So it seems like bash has this right.


Oh, nice! If this it what the standards say, I like it. I just wanted to 
be objective. A script breaker will it be anyway, any change on this.


--
Harald


___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Possible Unicode Problems in Busybox - Collect and Discussion

2014-08-13 Thread Harald Becker

Hi Denys!

 2) shell substitution ${#var}

Does this length operation shall give the number of bytes in var or the
number of characters (which may differ for multi byte characters, like
UTF-8).


bash gives number of Unicode chars.
I just fixed both ash and hush to do the same.


You fixed this one, but there are two more related shell substitutions 
to modify (sorry didn't find them earlier):


${xxx:offset}
${xxx:offset:length}

offset and length are given in number of characters in bash, BB ash uses 
bytes.




___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox