Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Wolfgang Lux
Richard Frith-Macdonald wrote:

 We could probably adapt your patch to use precision as string lengh in those 
 cases where it will work, but you can't catch all cases that way ... so maybe 
 it's better if people find out as soon as possible that c-strings have to be 
 nul terminated.
 
 Sorry about this ... but it's a behavior inherited from the C stdio library 
 and posix etc standards.  My own feeling is that format strings *ought* to 
 provide some way of working with unterminated strings, but they just don't, 
 so you have to copy the data into a big enough buffer, add the nul 
 terminator, and use that buffer intead of the original data :-(

I don't think your description of the standards is correct. My copy of the ANSI 
C'99 standard has this to say on the %s format specifier:
If the precision is specified, no more than that many characters are written. 
If the precision is not specified or is greater than the size of the array, the 
array shall contain a null character.
With that specification, I'd say that Chris's code is correct. He uses an array 
containing 50 bytes and uses precision 50, so the array shouldn't require a 
NULL terminator.

Wolfgang


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Richard Frith-Macdonald

On 3 Oct 2012, at 00:06, Stefan Bidi wrote:

 I just wanted to weight in real quick.  Chris proposed behavior is
 exactly how I wrote the CoreBase string formatting function.  I tested
 this how fprintf() works on Debian and SUSE, and came to the same
 conclusions as Chris.
 
 I believe, more to the point, is that the call to strlen isn't needed.
 A fixed precision is already given, so why waste time looking for
 NULL?


The thing is ... strlen() is supposed to get the number of chars (bytes) ... 
and that's not what the precision part of the forrmat string tells us.

For instance, if your locale is (as on most systems nowadays) a utf-8 based 
one, a character is usually one byte but it may be more than one byte.  If the 
code assumes that a precision of 5 means a 5 byte input string, but the string 
actually contains 5 2-byte characters, then your output gets 
truncated/corrupted.

Your code may be fine, depending on what it's using the precision field for, 
but in GSFormat the code is sorting out the amount of space needed ... so it 
needs the byte-length of the input string.

Anyway, because precision is in characters and length is in bytes we can't 
assume that precision tells us length.  I guess we could write code which would 
parse the string character by character (to no more than the precision 
specified number of characters) to determine the length in bytes.  The thing is 
... do we want to go to that extra effort to more tolerantly handle the case 
where the caller has supplied an illegal argument?

Also, strlen() is efficient ... I don't know if it's a consideration in 
practice (and suspect it isn't really an issue), but more complex and slower 
code might concievably slow some applications.



___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Richard Frith-Macdonald

On 3 Oct 2012, at 08:09, Wolfgang Lux wrote:

 Richard Frith-Macdonald wrote:
 
 We could probably adapt your patch to use precision as string lengh in those 
 cases where it will work, but you can't catch all cases that way ... so 
 maybe it's better if people find out as soon as possible that c-strings have 
 to be nul terminated.
 
 Sorry about this ... but it's a behavior inherited from the C stdio library 
 and posix etc standards.  My own feeling is that format strings *ought* to 
 provide some way of working with unterminated strings, but they just don't, 
 so you have to copy the data into a big enough buffer, add the nul 
 terminator, and use that buffer intead of the original data :-(
 
 I don't think your description of the standards is correct. My copy of the 
 ANSI C'99 standard has this to say on the %s format specifier:
 If the precision is specified, no more than that many characters are 
 written. If the precision is not specified or is greater than the size of the 
 array, the array shall contain a null character.
 With that specification, I'd say that Chris's code is correct. He uses an 
 array containing 50 bytes and uses precision 50, so the array shouldn't 
 require a NULL terminator.

Oh, that's a different section of the documentationm (I was reading the bit 
dealing with precision, and I just found the bit you quote under the 's' flag).
Which would mean there are apparent inconsistencies ... so I looked further 
(specifically at recent xopen documentation ... which really ought to be 
authoritative for modern software).

And ... that's different again ... the xopen docs make it clear that they are 
talking about *bytes* (so the current implementation is wrong) where other 
documentation talks about characters:

The argument shall be a pointer to an array of char. Bytes from the array shall 
be written up to (but not including) any terminating null byte. If the 
precision is specified, no more than that many bytes shall be written. If the 
precision is not specified or is greater than the size of the array, the 
application shall ensure that the array contains a null byte.
If an l (ell) qualifier is present, the argument shall be a pointer to an array 
of type wchar_t. Wide characters from the array shall be converted to 
characters (each as if by a call to the wcrtomb() function, with the conversion 
state described by an mbstate_t object initialized to zero before the first 
wide character is converted) up to and including a terminating null wide 
character. The resulting characters shall be written up to (but not including) 
the terminating null character (byte). If no precision is specified, the 
application shall ensure that the array contains a null wide character. If a 
precision is specified, no more than that many characters (bytes) shall be 
written (including shift sequences, if any), and the array shall contain a null 
wide character if, to equal the character sequence length given by the 
precision, the function would need to access a wide character one past the end 
of the array. In no case shall a partial character be written.

Interestingly, they are very specific about saying that the precision is a 
number of bytes rather than a number of characters (quite different from the 
older documentation I was looking at before) even in the case where the output 
is wide characters.  They even mention omitting the last character if it's a 
multibyte one and not all bytes would be permitted by the precision. 

Maybe we should update the code to try to match the modern standard, but ... in 
the context of GSFormat adopting a byte-based output precision would be very 
counter-intuitive since an NSString deals with UTF-16 and everyone expects the 
precision to give a number of 16bit characters in the resulting NSString object.

So I'm not sure what to do ... the C standards have changed from working with 
characters to working with bytes (which is good), but we can't simply adopt 
that because it would break OSX compatibility (and people's reasonable 
expectations).

Perhaps what we need is what I suggested (as a complex/inefficient option) in 
an earlier email ... to parse the input string character by character and treat 
the precision as a limit on the number of characters we read from it.
Perhaps tests on OSX to reverse-engineer Apple's behavior are our best bet.
___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev


Re: NSString bug with test and really dodgy patch.

2012-10-03 Thread Jens Ayton
On Oct 3, 2012, at 09:53, Richard Frith-Macdonald rich...@tiptree.demon.co.uk 
wrote:
 
 So I'm not sure what to do ... the C standards have changed from working with 
 characters to working with bytes (which is good),

Well, no. In the C standard, character generally means the same thing as 
byte (i.e., a value that can fit in a char). In point of fact, the standard 
provides two conflicting normative definitions of character (one marked 
abstract, the other C), but in the specification [f]printf() it seems 
character = byte is what is meant. Both the C99 and C11 final drafts have a 
footnote saying No special provisions are made for multibyte characters.

The sentence In no case is a partial multibyte character written. only 
applies to %ls format, i.e. when converting a wchar_t* string into a possibly 
multi-byte sequence for a char* string.

The closest analogue to NSString formatting is using %s in [f]wprintf(). In 
this case, characters (i.e., bytes) from the string are converted as if by 
repeated calls to the mbrtowc function (with sane initial state), and the 
precision limits the number of wide characters to be written. This is 
unproblematic because wchar_ts are required to be complete code units, but 
Foundation unichars can be UTF-16 surrogates, so this still doesn't resolve the 
issue.

In summary, figure out what Cocoa does. :-)


-- 
Jens Ayton


___
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev