Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Wolfgang Lux
Richard Frith-Macdonald wrote:

 We could probably adapt your patch to use precision as string lengh in those 
 cases where it will work, but you can't catch all cases that way ... so maybe 
 it's better if people find out as soon as possible that c-strings have to be 
 nul terminated.
 
 Sorry about this ... but it's a behavior inherited from the C stdio library 
 and posix etc standards.  My own feeling is that format strings *ought* to 
 provide some way of working with unterminated strings, but they just don't, 
 so you have to copy the data into a big enough buffer, add the nul 
 terminator, and use that buffer intead of the original data :-(

I don't think your description of the standards is correct. My copy of the ANSI 
C'99 standard has this to say on the %s format specifier:
If the precision is specified, no more than that many characters are written. 
If the precision is not specified or is greater than the size of the array, the 
array shall contain a null character.
With that specification, I'd say that Chris's code is correct. He uses an array 
containing 50 bytes and uses precision 50, so the array shouldn't require a 
NULL terminator.

Wolfgang


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Richard Frith-Macdonald

On 3 Oct 2012, at 00:06, Stefan Bidi wrote:

 I just wanted to weight in real quick.  Chris proposed behavior is
 exactly how I wrote the CoreBase string formatting function.  I tested
 this how fprintf() works on Debian and SUSE, and came to the same
 conclusions as Chris.
 
 I believe, more to the point, is that the call to strlen isn't needed.
 A fixed precision is already given, so why waste time looking for
 NULL?


The thing is ... strlen() is supposed to get the number of chars (bytes) ... 
and that's not what the precision part of the forrmat string tells us.

For instance, if your locale is (as on most systems nowadays) a utf-8 based 
one, a character is usually one byte but it may be more than one byte.  If the 
code assumes that a precision of 5 means a 5 byte input string, but the string 
actually contains 5 2-byte characters, then your output gets 
truncated/corrupted.

Your code may be fine, depending on what it's using the precision field for, 
but in GSFormat the code is sorting out the amount of space needed ... so it 
needs the byte-length of the input string.

Anyway, because precision is in characters and length is in bytes we can't 
assume that precision tells us length.  I guess we could write code which would 
parse the string character by character (to no more than the precision 
specified number of characters) to determine the length in bytes.  The thing is 
... do we want to go to that extra effort to more tolerantly handle the case 
where the caller has supplied an illegal argument?

Also, strlen() is efficient ... I don't know if it's a consideration in 
practice (and suspect it isn't really an issue), but more complex and slower 
code might concievably slow some applications.



___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Richard Frith-Macdonald

On 3 Oct 2012, at 08:09, Wolfgang Lux wrote:

 Richard Frith-Macdonald wrote:
 
 We could probably adapt your patch to use precision as string lengh in those 
 cases where it will work, but you can't catch all cases that way ... so 
 maybe it's better if people find out as soon as possible that c-strings have 
 to be nul terminated.
 
 Sorry about this ... but it's a behavior inherited from the C stdio library 
 and posix etc standards.  My own feeling is that format strings *ought* to 
 provide some way of working with unterminated strings, but they just don't, 
 so you have to copy the data into a big enough buffer, add the nul 
 terminator, and use that buffer intead of the original data :-(
 
 I don't think your description of the standards is correct. My copy of the 
 ANSI C'99 standard has this to say on the %s format specifier:
 If the precision is specified, no more than that many characters are 
 written. If the precision is not specified or is greater than the size of the 
 array, the array shall contain a null character.
 With that specification, I'd say that Chris's code is correct. He uses an 
 array containing 50 bytes and uses precision 50, so the array shouldn't 
 require a NULL terminator.

Oh, that's a different section of the documentationm (I was reading the bit 
dealing with precision, and I just found the bit you quote under the 's' flag).
Which would mean there are apparent inconsistencies ... so I looked further 
(specifically at recent xopen documentation ... which really ought to be 
authoritative for modern software).

And ... that's different again ... the xopen docs make it clear that they are 
talking about *bytes* (so the current implementation is wrong) where other 
documentation talks about characters:

The argument shall be a pointer to an array of char. Bytes from the array shall 
be written up to (but not including) any terminating null byte. If the 
precision is specified, no more than that many bytes shall be written. If the 
precision is not specified or is greater than the size of the array, the 
application shall ensure that the array contains a null byte.
If an l (ell) qualifier is present, the argument shall be a pointer to an array 
of type wchar_t. Wide characters from the array shall be converted to 
characters (each as if by a call to the wcrtomb() function, with the conversion 
state described by an mbstate_t object initialized to zero before the first 
wide character is converted) up to and including a terminating null wide 
character. The resulting characters shall be written up to (but not including) 
the terminating null character (byte). If no precision is specified, the 
application shall ensure that the array contains a null wide character. If a 
precision is specified, no more than that many characters (bytes) shall be 
written (including shift sequences, if any), and the array shall contain a null 
wide character if, to equal the character sequence length given by the 
precision, the function would need to access a wide character one past the end 
of the array. In no case shall a partial character be written.

Interestingly, they are very specific about saying that the precision is a 
number of bytes rather than a number of characters (quite different from the 
older documentation I was looking at before) even in the case where the output 
is wide characters.  They even mention omitting the last character if it's a 
multibyte one and not all bytes would be permitted by the precision. 

Maybe we should update the code to try to match the modern standard, but ... in 
the context of GSFormat adopting a byte-based output precision would be very 
counter-intuitive since an NSString deals with UTF-16 and everyone expects the 
precision to give a number of 16bit characters in the resulting NSString object.

So I'm not sure what to do ... the C standards have changed from working with 
characters to working with bytes (which is good), but we can't simply adopt 
that because it would break OSX compatibility (and people's reasonable 
expectations).

Perhaps what we need is what I suggested (as a complex/inefficient option) in 
an earlier email ... to parse the input string character by character and treat 
the precision as a limit on the number of characters we read from it.
Perhaps tests on OSX to reverse-engineer Apple's behavior are our best bet.
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Jens Ayton
On Oct 3, 2012, at 09:53, Richard Frith-Macdonald rich...@tiptree.demon.co.uk 
wrote:
 
 So I'm not sure what to do ... the C standards have changed from working with 
 characters to working with bytes (which is good),

Well, no. In the C standard, character generally means the same thing as 
byte (i.e., a value that can fit in a char). In point of fact, the standard 
provides two conflicting normative definitions of character (one marked 
abstract, the other C), but in the specification [f]printf() it seems 
character = byte is what is meant. Both the C99 and C11 final drafts have a 
footnote saying No special provisions are made for multibyte characters.

The sentence In no case is a partial multibyte character written. only 
applies to %ls format, i.e. when converting a wchar_t* string into a possibly 
multi-byte sequence for a char* string.

The closest analogue to NSString formatting is using %s in [f]wprintf(). In 
this case, characters (i.e., bytes) from the string are converted as if by 
repeated calls to the mbrtowc function (with sane initial state), and the 
precision limits the number of wide characters to be written. This is 
unproblematic because wchar_ts are required to be complete code units, but 
Foundation unichars can be UTF-16 surrogates, so this still doesn't resolve the 
issue.

In summary, figure out what Cocoa does. :-)


-- 
Jens Ayton


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


NSString bug with test and really dodgy patch.

2012-10-02 Thread Chris Ball

I ran into some problems the other day with clang's address sanitizer getting
very cross with me and I eventually chased it down to [NSString initWithFormat:
arguments:]  specifically GSFormat.m blithely calls strlen on my input string
even if I have specified a precision, this is not good given that I'm not
dealing with null terminated strings for the most part.  I expect the same bug
affects stringWithFormat and probably other stuff as well but I haven't bothered
to look into what all uses the GSFormat stuff too closely.


Here is a little test driver showing the problem (you will need to be
running clang's address sanitizer or electric fence or something for an error
to trigger since the nature of the bug is that it is looking at memory that
doesn't belong to it. 

--SNIP SNIP---
#import Foundation/Foundation.h

#define BYTES 50
#define STRING
12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

void sub( const id arg, ...);

int main( void )
{
NSAutoreleasePool *pool  = [[NSAutoreleasePool alloc] init];
char *bob;

bob = malloc( BYTES );
memcpy( bob, STRING, BYTES );

sub( @%.*s\n, BYTES , bob );

[pool release];
}


void sub( const id arg, ...)
{
va_list ap;
NSString *str;

va_start(ap, arg);
str=[[NSString alloc] initWithFormat:arg arguments:ap];
va_end(ap);

NSLog(@%@, str );

[str release];
}
SNIP SNIP-


The fix I put together pretty much says that if a precision is specified, use
that for the length otherwise feel free to call strlen(), if I'm reading the
code right, I think this should work fine albeit with a slight potential
performance loss when there is a unicode string with a precision less than a
quarter of the string's length (and doubtless a trivial performance increase the
rest of the time since it isn't iterating a string for no reason).  Having said
all this I leave it to someone with a better familiarity with this stuff to
vet it because I only took a fairly shallow look at the code to get where I am.

---PATCHY PATCH---
--- gnustep-base-1.24.0/Source/GSFormat.m
2011-12-15 01:42:39.0 -0800 +++ GSFormat.m  2012-10-02
13:12:26.778053030 -0700 @@ -1759,59 +1759,35 @@
byteEncoding = GSPrivateIsByteEncoding(enc);
  }
 
-   len = strlen(str);  // Number of bytes to convert.
-   blen = len; // Size of unichar output buffer.
-
-   if (prec != -1)
+if ( prec == -1 )
+  {
+len = strlen(str); // Number of bytes to convert.
+blen = len;// Size of unichar output buffer.
+  } 
+else
  {
-   if (prec  len)
- {
-   /* We don't neeed an output buffer bigger than the
-* precision specifies.
-*/
-   blen = prec;
- }
-   if (byteEncoding == YES)
- {
-   /* Where the external encoding is one byte per character,
-* we know we don't need to convert more bytes than the
-* precision required for output.
-*/
-   if (prec  len)
- {
-   len = prec;
- }
- }
-   else if (prec * 4  len)
- {
-   /* We assume no multibyte encoding is going to use more
-* than the maximum four bytes used by utf-8 for any
-* unicode code point.  So we do not need to convert
-* more than four times the precision.
-*/
-   len = prec * 4;
- }
- }
+/* We don't need an output buffer bigger than the
+ * precision specifies.
+ */
+len = prec;
+blen = prec;
 
-   /* Allocate dynamically an array which definitely is long
-* enough for the unichar version.
-*/
-   if (blen  8192 || ((string = (unichar *)
- NSZoneMalloc(s-_zone, blen * sizeof(unichar))) == NULL))
- string = (unichar *) alloca (blen * sizeof(unichar));
-   else
- string_malloced = 1;
+/* Allocate dynamically an array which definitely is long
+ * enough for the unichar version.
+ */
+if (blen  8192 || ((string = (unichar *)
+  NSZoneMalloc(s-_zone, blen * sizeof(unichar))) == NULL))
+  string = (unichar *) alloca (blen * sizeof(unichar));
+else
+  string_malloced = 1;
 
-   

Re: NSString bug with test and really dodgy patch.

2012-10-02 Thread Richard Frith-Macdonald

On 2 Oct 2012, at 22:10, Chris Ball wrote:

 
 I ran into some problems the other day with clang's address sanitizer getting
 very cross with me and I eventually chased it down to [NSString 
 initWithFormat:
 arguments:]  specifically GSFormat.m blithely calls strlen on my input string
 even if I have specified a precision, this is not good given that I'm not
 dealing with null terminated strings for the most part.  I expect the same bug
 affects stringWithFormat and probably other stuff as well but I haven't 
 bothered
 to look into what all uses the GSFormat stuff too closely.

I kind of agree with your sentiments (having been bitten by that myself when 
logging stuff), but I investigated it at the time and GSFormat.m is right and 
your code is wrong.

The reason is that the %s format deal with a nul terminated c-string argument, 
and by definition that's *not* an array of char whose length is determined by 
the precision of the format string, the nul terminator is *mandatory*.  If you 
code passes an argument which is not nul terminated than your code is doing 
something illegal and you can't really complain about *anything* that happens.
Also, the precision in the format works in conjunction with field width and 
alignment ... the format code needs to determine the length of the string you 
passed (using strlen) when it decides which part of the string to use ... so in 
the case where the rightmost part of the string should be displayed, using the 
precision as the length would give the wrong result.

We could probably adapt your patch to use precision as string lengh in those 
cases where it will work, but you can't catch all cases that way ... so maybe 
it's better if people find out as soon as possible that c-strings have to be 
nul terminated.

Sorry about this ... but it's a behavior inherited from the C stdio library and 
posix etc standards.  My own feeling is that format strings *ought* to provide 
some way of working with unterminated strings, but they just don't, so you have 
to copy the data into a big enough buffer, add the nul terminator, and use that 
buffer intead of the original data :-(
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-02 Thread Chris Ball

 I kind of agree with your sentiments (having been bitten by that myself when 
 logging stuff), but I investigated it at the time and GSFormat.m is right and 
 your code is wrong.
 
 The reason is that the %s format deal with a nul terminated c-string 
 argument, and by definition that's *not* an array of char whose length is 
 determined by the precision of the format string, the nul terminator is 
 *mandatory*.  If you code passes an argument which is not nul terminated than 
 your code is doing something illegal and you can't really complain about 
 *anything* that happens.
 Also, the precision in the format works in conjunction with field width and 
 alignment ... the format code needs to determine the length of the string you 
 passed (using strlen) when it decides which part of the string to use ... so 
 in the case where the rightmost part of the string should be displayed, using 
 the precision as the length would give the wrong result.
 
 We could probably adapt your patch to use precision as string lengh in those 
 cases where it will work, but you can't catch all cases that way ... so maybe 
 it's better if people find out as soon as possible that c-strings have to be 
 nul terminated.
 
 Sorry about this ... but it's a behavior inherited from the C stdio library 
 and posix etc standards.  My own feeling is that format strings *ought* to 
 provide some way of working with unterminated strings, but they just don't, 
 so you have to copy the data into a big enough buffer, add the nul 
 terminator, and use that buffer intead of the original data :-(

Interesting, I've never read the actual standard, my copy of KR (2nd ed.) just
says (in table B-1);

's' char *; characters from the string are printed until a '\0' is reached or
until the number of characters indicated by the precision have been printed.

So by from the way KR reads it is a bug.  No idea about POSIX et. al. though.
I find it humorous that my book opened to exactly that page and I haven't looked
in there in quite a number of years.


Chris.

___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-02 Thread Stefan Bidi
I just wanted to weight in real quick.  Chris proposed behavior is
exactly how I wrote the CoreBase string formatting function.  I tested
this how fprintf() works on Debian and SUSE, and came to the same
conclusions as Chris.

I believe, more to the point, is that the call to strlen isn't needed.
 A fixed precision is already given, so why waste time looking for
NULL?

Running the attached test program with valgrind does not produce an
error.  Changing the precision to 5 will, in fact, produce an invalid
read error.

Stef

On Tue, Oct 2, 2012 at 5:30 PM, Chris Ball li...@borderlinetech.ca wrote:

 I kind of agree with your sentiments (having been bitten by that myself when 
 logging stuff), but I investigated it at the time and GSFormat.m is right 
 and your code is wrong.

 The reason is that the %s format deal with a nul terminated c-string 
 argument, and by definition that's *not* an array of char whose length is 
 determined by the precision of the format string, the nul terminator is 
 *mandatory*.  If you code passes an argument which is not nul terminated 
 than your code is doing something illegal and you can't really complain 
 about *anything* that happens.
 Also, the precision in the format works in conjunction with field width and 
 alignment ... the format code needs to determine the length of the string 
 you passed (using strlen) when it decides which part of the string to use 
 ... so in the case where the rightmost part of the string should be 
 displayed, using the precision as the length would give the wrong result.

 We could probably adapt your patch to use precision as string lengh in those 
 cases where it will work, but you can't catch all cases that way ... so 
 maybe it's better if people find out as soon as possible that c-strings have 
 to be nul terminated.

 Sorry about this ... but it's a behavior inherited from the C stdio library 
 and posix etc standards.  My own feeling is that format strings *ought* to 
 provide some way of working with unterminated strings, but they just don't, 
 so you have to copy the data into a big enough buffer, add the nul 
 terminator, and use that buffer intead of the original data :-(

 Interesting, I've never read the actual standard, my copy of KR (2nd ed.) 
 just
 says (in table B-1);

 's' char *; characters from the string are printed until a '\0' is reached or
 until the number of characters indicated by the precision have been printed.

 So by from the way KR reads it is a bug.  No idea about POSIX et. al. though.
 I find it humorous that my book opened to exactly that page and I haven't 
 looked
 in there in quite a number of years.


 Chris.

 ___
 Gnustep-dev mailing list
 Gnustep-dev@gnu.org
 https://lists.gnu.org/mailman/listinfo/gnustep-dev
#include stdio.h
#include stdlib.h

char *test;

int main (void)
{
  test = malloc (4);
  
  test[0] = 'a';
  test[1] = 'b';
  test[2] = 'c';
  test[3] = 'd';
  fprintf (stdout, %.*s\n, 4, test);
  
  free (test);
  return 0;
}
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev