Re: If invalid string should crash(was:string need to be robust)

2011-03-14 Thread Jussi Jumppanen
%u Wrote:

 I agree with a), but not b), Can't find anything in unicode standard says
 you can use the low surrogate like that

According to: http://www.cl.cam.ac.uk/~mgk25/

According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a malformed sequence in the same way
that it interprets a character that is outside the adopted subset and
characters that are not within the adopted subset shall be indicated
to the user by a receiving device. A quite commonly used approach in
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
replacement character (U+FFFD), which looks a bit like an inverted
question mark, or a similar symbol. 

Refer to this file for the above quote: 

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt



Re: If invalid string should crash(was:string need to be robust)

2011-03-14 Thread ZY Zhou
Thank you Jussi,

But still this is not part of the standard, U+FFFD is a commonly used approach,
while the U+DC80..U+DCFF is also a common solution for
that(http://en.wikipedia.org/wiki/Utf8#Invalid_byte_sequences), different 
approach
solve different problems.

I think the current problem in D is that std.utf module is ill defined, it's not
designed to make developer's life easier. It just make the developers to ignore
the case that utf8 string can be invalid.

--ZY Zhou

== Quote from Jussi Jumppanen (jus...@zeusedit.com)'s article
 %u Wrote:
  I agree with a), but not b), Can't find anything in unicode standard says
  you can use the low surrogate like that
 According to: http://www.cl.cam.ac.uk/~mgk25/
 According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
 receiving UTF-8 shall interpret a malformed sequence in the same way
 that it interprets a character that is outside the adopted subset and
 characters that are not within the adopted subset shall be indicated
 to the user by a receiving device. A quite commonly used approach in
 UTF-8 decoders is to replace any malformed UTF-8 sequence by a
 replacement character (U+FFFD), which looks a bit like an inverted
 question mark, or a similar symbol.
 Refer to this file for the above quote:
 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt



Re: If invalid string should crash(was:string need to be robust)

2011-03-14 Thread spir

On 03/14/2011 07:55 AM, ZY Zhou wrote:

Thank you Jussi,

But still this is not part of the standard, U+FFFD is a commonly used approach,
while the U+DC80..U+DCFF is also a common solution for
that(http://en.wikipedia.org/wiki/Utf8#Invalid_byte_sequences), different 
approach
solve different problems.


I am surprised of some of your very affirmative statements (all along the 
thread). None of the string processing libs I have met use the approach you 
propose here, which is replacing invalid input by other invalid data (surrogate 
values). On the other hand, the replacement character (0xFFFD) evoked by Jussi 
(which I also proposed in a previous post) is a valid Unicode code point; same 
for free user-avalable areas.



I think the current problem in D is that std.utf module is ill defined, it's not
designed to make developer's life easier. It just make the developers to ignore
the case that utf8 string can be invalid.


On the contrary, D perfectly deals with invalid input by signalling it to you 
programmer. It is not ignored, which would be the worse approach. What to do 
with invalid input belongs to your application's logic (as pointed by 
Jonathan); you are demanding D standard libs to do your job at your place, 
exactly the way you want it, using an incorrect approach.


Denis


--ZY Zhou

== Quote from Jussi Jumppanen (jus...@zeusedit.com)'s article

%u Wrote:

I agree with a), but not b), Can't find anything in unicode standard says
you can use the low surrogate like that

According to: http://www.cl.cam.ac.uk/~mgk25/
 According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
 receiving UTF-8 shall interpret a malformed sequence in the same way
 that it interprets a character that is outside the adopted subset and
 characters that are not within the adopted subset shall be indicated
 to the user by a receiving device. A quite commonly used approach in
 UTF-8 decoders is to replace any malformed UTF-8 sequence by a
 replacement character (U+FFFD), which looks a bit like an inverted
 question mark, or a similar symbol.
Refer to this file for the above quote:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt




--
_
vita es estrany
spir.wikidot.com



Re: If invalid string should crash(was:string need to be robust)

2011-03-14 Thread Kagamin
Jussi Jumppanen Wrote:

 %u Wrote:
 
  I agree with a), but not b), Can't find anything in unicode standard says
  you can use the low surrogate like that
 
 According to: http://www.cl.cam.ac.uk/~mgk25/
 
 According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
 receiving UTF-8 shall interpret a malformed sequence in the same way
 that it interprets a character that is outside the adopted subset and
 characters that are not within the adopted subset shall be indicated
 to the user by a receiving device. A quite commonly used approach in
 UTF-8 decoders is to replace any malformed UTF-8 sequence by a
 replacement character (U+FFFD), which looks a bit like an inverted
 question mark, or a similar symbol. 
 
 Refer to this file for the above quote: 
 
 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Sounds like a text rendering guideline rather than a text processing guideline.


Re: If invalid string should crash(was:string need to be robust)

2011-03-14 Thread Simen kjaeraas

ZY Zhou rin...@gmail.com wrote:



But for the following case, it is complete wrong if it crash at line 3:


Why? That is the point where you are actually saying 'I care about
individual characters in this string'.



 1:  char[] c = [0xA0];
 2:  string s = c.idup;
 3:  foreach(dchar d; s){}

The expected result is either:
 a) crash at line 2, c is not valid utf
and can't be converted to string


A char[] is just as bound by the rules as is string (which is simply
immutable(char)[]). Thus the program should feel free to expect it
to contain valid utf-8 data. Validating each string upon every single
copy operation is unacceptable overhead.



or:
 b) don't crash, and d = 0xDCA0;


b is unacceptable in the general case. It may be good for your specific
situation, but in general, it is simply ignoring an error.

--
Simen


If invalid string should crash(was:string need to be robust)

2011-03-13 Thread ZY Zhou
Hi,

invalid utf8 code always break my program, so I suggest if invalid code in
utf8 need to be converted to dchar, use the low surrogate code
points(DC80~DCFF) instead of crashing the program. But many people here don't
like this idea, you think exception is the right thing. OK, let me ask you a
question:

Do you always try/catch for invalid utf when reading a file?
I believe you don't, you simply don't care.

While the text file is invalid, this use case itself is valid. Should a
browser crash on a web page with charset=utf8 but has invalid utf8 code in it?
Exception doesn't help either, using them in this case is almost like writing
a utf8 decoder yourself.

Anyway, since I'm already using my own utf decoder, I don't care if you agree
with me or not.

But for the following case, it is complete wrong if it crash at line 3:

 1:  char[] c = [0xA0];
 2:  string s = c.idup;
 3:  foreach(dchar d; s){}

The expected result is either:
 a) crash at line 2, c is not valid utf
and can't be converted to string
or:
 b) don't crash, and d = 0xDCA0;


--ZY Zhou


Re: If invalid string should crash(was:string need to be robust)

2011-03-13 Thread %u
 But for the following case, it is complete wrong if it crash at line 3:
  1:  char[] c = [0xA0];
  2:  string s = c.idup;
  3:  foreach(dchar d; s){}
 The expected result is either:
  a) crash at line 2, c is not valid utf and can't be converted to string
 or:
  b) don't crash, and d = 0xDCA0;

I agree with a), but not b), Can't find anything in unicode standard says you 
can
use the low surrogate like that