Re: NSString bug with test and really dodgy patch.
Richard Frith-Macdonald wrote: We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( I don't think your description of the standards is correct. My copy of the ANSI C'99 standard has this to say on the %s format specifier: If the precision is specified, no more than that many characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character. With that specification, I'd say that Chris's code is correct. He uses an array containing 50 bytes and uses precision 50, so the array shouldn't require a NULL terminator. Wolfgang ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
On 3 Oct 2012, at 00:06, Stefan Bidi wrote: I just wanted to weight in real quick. Chris proposed behavior is exactly how I wrote the CoreBase string formatting function. I tested this how fprintf() works on Debian and SUSE, and came to the same conclusions as Chris. I believe, more to the point, is that the call to strlen isn't needed. A fixed precision is already given, so why waste time looking for NULL? The thing is ... strlen() is supposed to get the number of chars (bytes) ... and that's not what the precision part of the forrmat string tells us. For instance, if your locale is (as on most systems nowadays) a utf-8 based one, a character is usually one byte but it may be more than one byte. If the code assumes that a precision of 5 means a 5 byte input string, but the string actually contains 5 2-byte characters, then your output gets truncated/corrupted. Your code may be fine, depending on what it's using the precision field for, but in GSFormat the code is sorting out the amount of space needed ... so it needs the byte-length of the input string. Anyway, because precision is in characters and length is in bytes we can't assume that precision tells us length. I guess we could write code which would parse the string character by character (to no more than the precision specified number of characters) to determine the length in bytes. The thing is ... do we want to go to that extra effort to more tolerantly handle the case where the caller has supplied an illegal argument? Also, strlen() is efficient ... I don't know if it's a consideration in practice (and suspect it isn't really an issue), but more complex and slower code might concievably slow some applications. ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
On 3 Oct 2012, at 08:09, Wolfgang Lux wrote: Richard Frith-Macdonald wrote: We could probably adapt your patch to use precision as string lengh in those cases where it will work, but you can't catch all cases that way ... so maybe it's better if people find out as soon as possible that c-strings have to be nul terminated. Sorry about this ... but it's a behavior inherited from the C stdio library and posix etc standards. My own feeling is that format strings *ought* to provide some way of working with unterminated strings, but they just don't, so you have to copy the data into a big enough buffer, add the nul terminator, and use that buffer intead of the original data :-( I don't think your description of the standards is correct. My copy of the ANSI C'99 standard has this to say on the %s format specifier: If the precision is specified, no more than that many characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character. With that specification, I'd say that Chris's code is correct. He uses an array containing 50 bytes and uses precision 50, so the array shouldn't require a NULL terminator. Oh, that's a different section of the documentationm (I was reading the bit dealing with precision, and I just found the bit you quote under the 's' flag). Which would mean there are apparent inconsistencies ... so I looked further (specifically at recent xopen documentation ... which really ought to be authoritative for modern software). And ... that's different again ... the xopen docs make it clear that they are talking about *bytes* (so the current implementation is wrong) where other documentation talks about characters: The argument shall be a pointer to an array of char. Bytes from the array shall be written up to (but not including) any terminating null byte. If the precision is specified, no more than that many bytes shall be written. If the precision is not specified or is greater than the size of the array, the application shall ensure that the array contains a null byte. If an l (ell) qualifier is present, the argument shall be a pointer to an array of type wchar_t. Wide characters from the array shall be converted to characters (each as if by a call to the wcrtomb() function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting characters shall be written up to (but not including) the terminating null character (byte). If no precision is specified, the application shall ensure that the array contains a null wide character. If a precision is specified, no more than that many characters (bytes) shall be written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case shall a partial character be written. Interestingly, they are very specific about saying that the precision is a number of bytes rather than a number of characters (quite different from the older documentation I was looking at before) even in the case where the output is wide characters. They even mention omitting the last character if it's a multibyte one and not all bytes would be permitted by the precision. Maybe we should update the code to try to match the modern standard, but ... in the context of GSFormat adopting a byte-based output precision would be very counter-intuitive since an NSString deals with UTF-16 and everyone expects the precision to give a number of 16bit characters in the resulting NSString object. So I'm not sure what to do ... the C standards have changed from working with characters to working with bytes (which is good), but we can't simply adopt that because it would break OSX compatibility (and people's reasonable expectations). Perhaps what we need is what I suggested (as a complex/inefficient option) in an earlier email ... to parse the input string character by character and treat the precision as a limit on the number of characters we read from it. Perhaps tests on OSX to reverse-engineer Apple's behavior are our best bet. ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev
Re: NSString bug with test and really dodgy patch.
On Oct 3, 2012, at 09:53, Richard Frith-Macdonald rich...@tiptree.demon.co.uk wrote: So I'm not sure what to do ... the C standards have changed from working with characters to working with bytes (which is good), Well, no. In the C standard, character generally means the same thing as byte (i.e., a value that can fit in a char). In point of fact, the standard provides two conflicting normative definitions of character (one marked abstract, the other C), but in the specification [f]printf() it seems character = byte is what is meant. Both the C99 and C11 final drafts have a footnote saying No special provisions are made for multibyte characters. The sentence In no case is a partial multibyte character written. only applies to %ls format, i.e. when converting a wchar_t* string into a possibly multi-byte sequence for a char* string. The closest analogue to NSString formatting is using %s in [f]wprintf(). In this case, characters (i.e., bytes) from the string are converted as if by repeated calls to the mbrtowc function (with sane initial state), and the precision limits the number of wide characters to be written. This is unproblematic because wchar_ts are required to be complete code units, but Foundation unichars can be UTF-16 surrogates, so this still doesn't resolve the issue. In summary, figure out what Cocoa does. :-) -- Jens Ayton ___ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev