Re: NSString bug with test and really dodgy patch.
Richard Frith-Macdonald wrote: We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( I don't think your description of the standards is correct. My copy of the ANSI C'99 standard has this to say on the %s format specifier: If the precision is specified, no more than that many characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character. With that specification, I'd say that Chris's code is correct. He uses an array containing 50 bytes and uses precision 50, so the array shouldn't require a NULL terminator. Wolfgang ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
On 3 Oct 2012, at 00:06, Stefan Bidi wrote: I just wanted to weight in real quick. Chris proposed behavior is exactly how I wrote the CoreBase string formatting function. I tested this how fprintf() works on Debian and SUSE, and came to the same conclusions as Chris. I believe, more to the point, is that the call to strlen isn't needed. A fixed precision is already given, so why waste time looking for NULL? The thing is ... strlen() is supposed to get the number of chars (bytes) ... and that's not what the precision part of the forrmat string tells us. For instance, if your locale is (as on most systems nowadays) a utf-8 based one, a character is usually one byte but it may be more than one byte. If the code assumes that a precision of 5 means a 5 byte input string, but the string actually contains 5 2-byte characters, then your output gets truncated/corrupted. Your code may be fine, depending on what it's using the precision field for, but in GSFormat the code is sorting out the amount of space needed ... so it needs the byte-length of the input string. Anyway, because precision is in characters and length is in bytes we can't assume that precision tells us length. I guess we could write code which would parse the string character by character (to no more than the precision specified number of characters) to determine the length in bytes. The thing is ... do we want to go to that extra effort to more tolerantly handle the case where the caller has supplied an illegal argument? Also, strlen() is efficient ... I don't know if it's a consideration in practice (and suspect it isn't really an issue), but more complex and slower code might concievably slow some applications. ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
On 3 Oct 2012, at 08:09, Wolfgang Lux wrote: Richard Frith-Macdonald wrote: We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( I don't think your description of the standards is correct. My copy of the ANSI C'99 standard has this to say on the %s format specifier: If the precision is specified, no more than that many characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character. With that specification, I'd say that Chris's code is correct. He uses an array containing 50 bytes and uses precision 50, so the array shouldn't require a NULL terminator. Oh, that's a different section of the documentationm (I was reading the bit dealing with precision, and I just found the bit you quote under the 's' flag). Which would mean there are apparent inconsistencies ... so I looked further (specifically at recent xopen documentation ... which really ought to be authoritative for modern software). And ... that's different again ... the xopen docs make it clear that they are talking about *bytes* (so the current implementation is wrong) where other documentation talks about characters: The argument shall be a pointer to an array of char. Bytes from the array shall be written up to (but not including) any terminating null byte. If the precision is specified, no more than that many bytes shall be written. If the precision is not specified or is greater than the size of the array, the application shall ensure that the array contains a null byte. If an l (ell) qualifier is present, the argument shall be a pointer to an array of type wchar_t. Wide characters from the array shall be converted to characters (each as if by a call to the wcrtomb() function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting characters shall be written up to (but not including) the terminating null character (byte). If no precision is specified, the application shall ensure that the array contains a null wide character. If a precision is specified, no more than that many characters (bytes) shall be written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case shall a partial character be written. Interestingly, they are very specific about saying that the precision is a number of bytes rather than a number of characters (quite different from the older documentation I was looking at before) even in the case where the output is wide characters. They even mention omitting the last character if it's a multibyte one and not all bytes would be permitted by the precision. Maybe we should update the code to try to match the modern standard, but ... in the context of GSFormat adopting a byte-based output precision would be very counter-intuitive since an NSString deals with UTF-16 and everyone expects the precision to give a number of 16bit characters in the resulting NSString object. So I'm not sure what to do ... the C standards have changed from working with characters to working with bytes (which is good), but we can't simply adopt that because it would break OSX compatibility (and people's reasonable expectations). Perhaps what we need is what I suggested (as a complex/inefficient option) in an earlier email ... to parse the input string character by character and treat the precision as a limit on the number of characters we read from it. Perhaps tests on OSX to reverse-engineer Apple's behavior are our best bet. ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
On Oct 3, 2012, at 09:53, Richard Frith-Macdonald rich...@tiptree.demon.co.uk wrote: So I'm not sure what to do ... the C standards have changed from working with characters to working with bytes (which is good), Well, no. In the C standard, character generally means the same thing as byte (i.e., a value that can fit in a char). In point of fact, the standard provides two conflicting normative definitions of character (one marked abstract, the other C), but in the specification [f]printf() it seems character = byte is what is meant. Both the C99 and C11 final drafts have a footnote saying No special provisions are made for multibyte characters. The sentence In no case is a partial multibyte character written. only applies to %ls format, i.e. when converting a wchar_t* string into a possibly multi-byte sequence for a char* string. The closest analogue to NSString formatting is using %s in [f]wprintf(). In this case, characters (i.e., bytes) from the string are converted as if by repeated calls to the mbrtowc function (with sane initial state), and the precision limits the number of wide characters to be written. This is unproblematic because wchar_ts are required to be complete code units, but Foundation unichars can be UTF-16 surrogates, so this still doesn't resolve the issue. In summary, figure out what Cocoa does. :-) -- Jens Ayton ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
NSString bug with test and really dodgy patch.
I ran into some problems the other day with clang's address sanitizer getting very cross with me and I eventually chased it down to [NSString initWithFormat: arguments:] specifically GSFormat.m blithely calls strlen on my input string even if I have specified a precision, this is not good given that I'm not dealing with null terminated strings for the most part. I expect the same bug affects stringWithFormat and probably other stuff as well but I haven't bothered to look into what all uses the GSFormat stuff too closely. Here is a little test driver showing the problem (you will need to be running clang's address sanitizer or electric fence or something for an error to trigger since the nature of the bug is that it is looking at memory that doesn't belong to it. --SNIP SNIP--- #import Foundation/Foundation.h #define BYTES 50 #define STRING 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 void sub( const id arg, ...); int main( void ) { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; char *bob; bob = malloc( BYTES ); memcpy( bob, STRING, BYTES ); sub( @%.*s\n, BYTES , bob ); [pool release]; } void sub( const id arg, ...) { va_list ap; NSString *str; va_start(ap, arg); str=[[NSString alloc] initWithFormat:arg arguments:ap]; va_end(ap); NSLog(@%@, str ); [str release]; } SNIP SNIP- The fix I put together pretty much says that if a precision is specified, use that for the length otherwise feel free to call strlen(), if I'm reading the code right, I think this should work fine albeit with a slight potential performance loss when there is a unicode string with a precision less than a quarter of the string's length (and doubtless a trivial performance increase the rest of the time since it isn't iterating a string for no reason). Having said all this I leave it to someone with a better familiarity with this stuff to vet it because I only took a fairly shallow look at the code to get where I am. ---PATCHY PATCH--- --- gnustep-base-1.24.0/Source/GSFormat.m 2011-12-15 01:42:39.0 -0800 +++ GSFormat.m 2012-10-02 13:12:26.778053030 -0700 @@ -1759,59 +1759,35 @@ byteEncoding = GSPrivateIsByteEncoding(enc); } - len = strlen(str); // Number of bytes to convert. - blen = len; // Size of unichar output buffer. - - if (prec != -1) +if ( prec == -1 ) + { +len = strlen(str); // Number of bytes to convert. +blen = len;// Size of unichar output buffer. + } +else { - if (prec len) - { - /* We don't neeed an output buffer bigger than the -* precision specifies. -*/ - blen = prec; - } - if (byteEncoding == YES) - { - /* Where the external encoding is one byte per character, -* we know we don't need to convert more bytes than the -* precision required for output. -*/ - if (prec len) - { - len = prec; - } - } - else if (prec * 4 len) - { - /* We assume no multibyte encoding is going to use more -* than the maximum four bytes used by utf-8 for any -* unicode code point. So we do not need to convert -* more than four times the precision. -*/ - len = prec * 4; - } - } +/* We don't need an output buffer bigger than the + * precision specifies. + */ +len = prec; +blen = prec; - /* Allocate dynamically an array which definitely is long -* enough for the unichar version. -*/ - if (blen 8192 || ((string = (unichar *) - NSZoneMalloc(s-_zone, blen * sizeof(unichar))) == NULL)) - string = (unichar *) alloca (blen * sizeof(unichar)); - else - string_malloced = 1; +/* Allocate dynamically an array which definitely is long + * enough for the unichar version. + */ +if (blen 8192 || ((string = (unichar *) + NSZoneMalloc(s-_zone, blen * sizeof(unichar))) == NULL)) + string = (unichar *) alloca (blen * sizeof(unichar)); +else + string_malloced = 1; -
Re: NSString bug with test and really dodgy patch.
On 2 Oct 2012, at 22:10, Chris Ball wrote: I ran into some problems the other day with clang's address sanitizer getting very cross with me and I eventually chased it down to [NSString initWithFormat: arguments:] specifically GSFormat.m blithely calls strlen on my input string even if I have specified a precision, this is not good given that I'm not dealing with null terminated strings for the most part. I expect the same bug affects stringWithFormat and probably other stuff as well but I haven't bothered to look into what all uses the GSFormat stuff too closely. I kind of agree with your sentiments (having been bitten by that myself when logging stuff), but I investigated it at the time and GSFormat.m is right and your code is wrong. The reason is that the %s format deal with a nul terminated c-string argument, and by definition that's *not* an array of char whose length is determined by the precision of the format string, the nul terminator is *mandatory*. If you code passes an argument which is not nul terminated than your code is doing something illegal and you can't really complain about *anything* that happens. Also, the precision in the format works in conjunction with field width and alignment ... the format code needs to determine the length of the string you passed (using strlen) when it decides which part of the string to use ... so in the case where the rightmost part of the string should be displayed, using the precision as the length would give the wrong result. We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
I kind of agree with your sentiments (having been bitten by that myself when logging stuff), but I investigated it at the time and GSFormat.m is right and your code is wrong. The reason is that the %s format deal with a nul terminated c-string argument, and by definition that's *not* an array of char whose length is determined by the precision of the format string, the nul terminator is *mandatory*. If you code passes an argument which is not nul terminated than your code is doing something illegal and you can't really complain about *anything* that happens. Also, the precision in the format works in conjunction with field width and alignment ... the format code needs to determine the length of the string you passed (using strlen) when it decides which part of the string to use ... so in the case where the rightmost part of the string should be displayed, using the precision as the length would give the wrong result. We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( Interesting, I've never read the actual standard, my copy of KR (2nd ed.) just says (in table B-1); 's' char *; characters from the string are printed until a '\0' is reached or until the number of characters indicated by the precision have been printed. So by from the way KR reads it is a bug. No idea about POSIX et. al. though. I find it humorous that my book opened to exactly that page and I haven't looked in there in quite a number of years. Chris. ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
I just wanted to weight in real quick. Chris proposed behavior is exactly how I wrote the CoreBase string formatting function. I tested this how fprintf() works on Debian and SUSE, and came to the same conclusions as Chris. I believe, more to the point, is that the call to strlen isn't needed. A fixed precision is already given, so why waste time looking for NULL? Running the attached test program with valgrind does not produce an error. Changing the precision to 5 will, in fact, produce an invalid read error. Stef On Tue, Oct 2, 2012 at 5:30 PM, Chris Ball li...@borderlinetech.ca wrote: I kind of agree with your sentiments (having been bitten by that myself when logging stuff), but I investigated it at the time and GSFormat.m is right and your code is wrong. The reason is that the %s format deal with a nul terminated c-string argument, and by definition that's *not* an array of char whose length is determined by the precision of the format string, the nul terminator is *mandatory*. If you code passes an argument which is not nul terminated than your code is doing something illegal and you can't really complain about *anything* that happens. Also, the precision in the format works in conjunction with field width and alignment ... the format code needs to determine the length of the string you passed (using strlen) when it decides which part of the string to use ... so in the case where the rightmost part of the string should be displayed, using the precision as the length would give the wrong result. We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( Interesting, I've never read the actual standard, my copy of KR (2nd ed.) just says (in table B-1); 's' char *; characters from the string are printed until a '\0' is reached or until the number of characters indicated by the precision have been printed. So by from the way KR reads it is a bug. No idea about POSIX et. al. though. I find it humorous that my book opened to exactly that page and I haven't looked in there in quite a number of years. Chris. ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev #include stdio.h #include stdlib.h char *test; int main (void) { test = malloc (4); test[0] = 'a'; test[1] = 'b'; test[2] = 'c'; test[3] = 'd'; fprintf (stdout, %.*s\n, 4, test); free (test); return 0; } ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev