Re: string need to be robust

2011-03-14 Thread Jesse Phillips
ZY Zhou Wrote: it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent invalid utf8

Re: string need to be robust

2011-03-14 Thread KennyTM~
On Mar 14, 11 13:53, Jesse Phillips wrote: KennyTM~ Wrote: It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) --- import core.exception, std.stdio, std.conv;

Re: string need to be robust

2011-03-14 Thread Jonathan M Davis
On Sunday 13 March 2011 22:45:38 ZY Zhou wrote: it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar

Re: string need to be robust

2011-03-14 Thread Jacob Carlborg
On 2011-03-13 23:36, KennyTM~ wrote: On Mar 14, 11 02:55, Jacob Carlborg wrote: I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character. It is already throwing an exception called

Re: string need to be robust

2011-03-14 Thread Jacob Carlborg
On 2011-03-14 06:45, ZY Zhou wrote: it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent

string need to be robust

2011-03-13 Thread ZY Zhou
Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally unusable when processing files, since

Re: string need to be robust

2011-03-13 Thread Jonathan M Davis
On Sunday 13 March 2011 01:57:12 ZY Zhou wrote: Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string

Re: string need to be robust

2011-03-13 Thread ZY Zhou
std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream file = new BufferedFile(sample.txt);

Re: string need to be robust

2011-03-13 Thread Jonathan M Davis
On Sunday 13 March 2011 04:34:24 ZY Zhou wrote: std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid

Re: string need to be robust

2011-03-13 Thread %u
== Quote from ZY Zhou (rin...@geemail.com)'s article std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the

Re: string need to be robust

2011-03-13 Thread spir
On 03/13/2011 10:57 AM, ZY Zhou wrote: Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally

Re: string need to be robust

2011-03-13 Thread ZY Zhou
but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. As I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code points(0x80~0xFF - 0xDC80~0xDCFF), other

Re: string need to be robust

2011-03-13 Thread spir
On 03/13/2011 12:34 PM, ZY Zhou wrote: std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream

Re: string need to be robust

2011-03-13 Thread spir
On 03/13/2011 01:25 PM, ZY Zhou wrote: but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. As I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code

Re: string need to be robust

2011-03-13 Thread spir
On 03/13/2011 01:25 PM, ZY Zhou wrote: but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. As I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code

Re: string need to be robust

2011-03-13 Thread ZY Zhou
What if I'm making a text editor with D? I know the text has something wrong, I want to open it and fix it. the exception won't help, if the editor just refuse to open invalid file, then the editor is useless. Try open an invalid utf file with a text editor, like vim, you will understand what I

Re: string need to be robust

2011-03-13 Thread Michel Fortin
On 2011-03-13 10:18:24 -0400, ZY Zhou rin...@geeemail.com said: What if I'm making a text editor with D? I know the text has something wrong, I want to open it and fix it. the exception won't help, if the editor just refuse to open invalid file, then the editor is useless. Try open an invalid

Re: string need to be robust

2011-03-13 Thread ZY Zhou
If a invalid utf8 or utf16 code need to be converted to utf32, then it should be converted to an invalid utf32. that's why D800~DFFF are marked as invalid points in unicode standard. == Quote from spir (denis.s...@gmail.com)'s article This is not a good idea, imo. Surrogate values /are/ invalid

Re: string need to be robust

2011-03-13 Thread Jacob Carlborg
On 2011-03-13 13:22, spir wrote: On 03/13/2011 10:57 AM, ZY Zhou wrote: Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash.

Re: string need to be robust

2011-03-13 Thread spir
On 03/13/2011 04:43 PM, ZY Zhou wrote: If a invalid utf8 or utf16 code need to be converted to utf32, then it should be converted to an invalid utf32. that's why D800~DFFF are marked as invalid points in unicode standard. You are wrong on both points. First, there is no definition of invalid

Re: string need to be robust

2011-03-13 Thread Andrei Alexandrescu
On 3/13/11 1:55 PM, Jacob Carlborg wrote: I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character. Yah. In addition, the exception should provide index information such that an

Re: string need to be robust

2011-03-13 Thread Andrej Mitrovic
Crash - Have fun stepping through your code with a debugger, or worse, observe disassembly. Throw - (Hopefully) get an informative error message, which could mean you'll be able to fix the bug quickly.

Re: string need to be robust

2011-03-13 Thread ZY Zhou
it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent invalid utf8 code. string s = \x0A;

Re: string need to be robust

2011-03-13 Thread Jesse Phillips
KennyTM~ Wrote: It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) --- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f,