Re: string need to be robust

2011-03-14 Thread Jacob Carlborg

On 2011-03-14 06:45, ZY Zhou wrote:

it doesn't make sense to add try/catch every time you use 
tolower/toupper/foreach
on string. No one will do that.
You either throw exception when convert invalid utf8 bytes to string, or never
throw exception and use invalid UTF32 code in dchar to represent invalid utf8 
code.

   string s = "\x0A"; // this is the right place to throw the exception (or 
compile
error)
   s.tolower; // no one will add try/catch on this

--ZY Zhou

== Quote from Andrei Alexandrescu (seewebsiteforem...@erdani.org)'s article

On 3/13/11 1:55 PM, Jacob Carlborg wrote:

I would say that the functions should NOT crash but instead throw an
exception. Then the developer can choose what to do when there's an
invalid unicode character.

Yah. In addition, the exception should provide index information such
that an application can elect to continue processing.
Andrei


Depending on what kind of application you write, all you need could just 
be to wrap the main method in try/catch.


--
/Jacob Carlborg


Re: string need to be robust

2011-03-14 Thread Jacob Carlborg

On 2011-03-13 23:36, KennyTM~ wrote:

On Mar 14, 11 02:55, Jacob Carlborg wrote:

I would say that the functions should NOT crash but instead throw an
exception. Then the developer can choose what to do when there's an
invalid unicode character.


It is already throwing an exception called
core.exception.UnicodeException. This even provides you the index where
decoding failed.

(However Phobos is not using it, AFAIK.)

---
import core.exception, std.stdio, std.conv;

void main() {
char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc];
try {
foreach (dchar d; s){}
} catch (UnicodeException e) {
writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx]));
// error at index 5 (9a)
}
}
---


Good, then we can call it a day :) I assumed it crashed since he said it 
did.


--
/Jacob Carlborg


Re: string need to be robust

2011-03-13 Thread Jonathan M Davis
On Sunday 13 March 2011 22:45:38 ZY Zhou wrote:
> it doesn't make sense to add try/catch every time you use
> tolower/toupper/foreach on string. No one will do that.
> You either throw exception when convert invalid utf8 bytes to string, or
> never throw exception and use invalid UTF32 code in dchar to represent
> invalid utf8 code.
> 
>   string s = "\x0A"; // this is the right place to throw the exception (or
> compile error)
>   s.tolower; // no one will add try/catch on this

If you're going to worry about string validity, it should be checked when the 
string is initially created. If it's not valid, then fix it in whatever way you 
deem appropriate. After that, you shouldn't have to worry about string validity 
anymore. Honestly, invalid UTF-8 is pretty rare overall. You'll get it because 
of a bad file or somesuch, but once the string is valid, it stays valid. So, 
you 
really only have to worry about it when you read in files and such. Once the 
file 
has been correctly read in, you just use the strings without worrying about it. 
There shouldn't be a need to use try-catch blocks much of anywhere to worry 
about invalid unicode.

- Jonathan M Davis


Re: string need to be robust

2011-03-13 Thread KennyTM~

On Mar 14, 11 13:53, Jesse Phillips wrote:

KennyTM~ Wrote:


It is already throwing an exception called
core.exception.UnicodeException. This even provides you the index where
decoding failed.

(However Phobos is not using it, AFAIK.)

---
import core.exception, std.stdio, std.conv;

void main() {
  char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc];
  try {
  foreach (dchar d; s){}
  } catch (UnicodeException e) {
  writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx]));
  // error at index 5 (9a)
  }
}
---


foreach does not use the range interface provided by std.array (that I know 
of). So there should be a bug reported on the current behavior of Access 
Violation.


I haven't checked on Windows, but there is no access violation in 
foreach (http://ideone.com/WSBRm), and there is no access violation in 
the range interface either (as of 2.052 it throws a 
std.utf.UtfException), so I don't know what bug you are talking about.


Re: string need to be robust

2011-03-13 Thread Jesse Phillips
ZY Zhou Wrote:

> it doesn't make sense to add try/catch every time you use 
> tolower/toupper/foreach
> on string. No one will do that.
> You either throw exception when convert invalid utf8 bytes to string, or never
> throw exception and use invalid UTF32 code in dchar to represent invalid utf8 
> code.
> 
>   string s = "\x0A"; // this is the right place to throw the exception (or 
> compile
> error)

I don't think here is even a good place to have it. The problem being most 
strings/invalid strings will come in during run-time. And at run-time we have 
std.file.readText which will validate the string upon entry.

>   s.tolower; // no one will add try/catch on this

Right that isn't the point of throwing an exception. In this case you have not 
written you program to handle invalid Unicode, which means your program should 
not continue to operate until you decide to fix it. (Exceptions are robust)

However an access violation like the one received from using foreach is not 
appropriate and should be filed as a bug. You should be programming under the 
assumption that everything is how you expect it, and verify and remedy external 
inputs at the point of input. If you don't then the program shouldn't continue 
running.


Re: string need to be robust

2011-03-13 Thread Jesse Phillips
KennyTM~ Wrote:

> It is already throwing an exception called 
> core.exception.UnicodeException. This even provides you the index where 
> decoding failed.
> 
> (However Phobos is not using it, AFAIK.)
> 
> ---
> import core.exception, std.stdio, std.conv;
> 
> void main() {
>  char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc];
>  try {
>  foreach (dchar d; s){}
>  } catch (UnicodeException e) {
>  writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx]));
>  // error at index 5 (9a)
>  }
> }
> ---

foreach does not use the range interface provided by std.array (that I know 
of). So there should be a bug reported on the current behavior of Access 
Violation.


Re: string need to be robust

2011-03-13 Thread ZY Zhou
it doesn't make sense to add try/catch every time you use 
tolower/toupper/foreach
on string. No one will do that.
You either throw exception when convert invalid utf8 bytes to string, or never
throw exception and use invalid UTF32 code in dchar to represent invalid utf8 
code.

  string s = "\x0A"; // this is the right place to throw the exception (or 
compile
error)
  s.tolower; // no one will add try/catch on this

--ZY Zhou

== Quote from Andrei Alexandrescu (seewebsiteforem...@erdani.org)'s article
> On 3/13/11 1:55 PM, Jacob Carlborg wrote:
> > I would say that the functions should NOT crash but instead throw an
> > exception. Then the developer can choose what to do when there's an
> > invalid unicode character.
> Yah. In addition, the exception should provide index information such
> that an application can elect to continue processing.
> Andrei



Re: string need to be robust

2011-03-13 Thread KennyTM~

On Mar 14, 11 02:55, Jacob Carlborg wrote:

I would say that the functions should NOT crash but instead throw an
exception. Then the developer can choose what to do when there's an
invalid unicode character.


It is already throwing an exception called 
core.exception.UnicodeException. This even provides you the index where 
decoding failed.


(However Phobos is not using it, AFAIK.)

---
import core.exception, std.stdio, std.conv;

void main() {
char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc];
try {
foreach (dchar d; s){}
} catch (UnicodeException e) {
writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx]));
// error at index 5 (9a)
}
}
---


Re: string need to be robust

2011-03-13 Thread Andrej Mitrovic
Crash -> Have fun stepping through your code with a debugger, or
worse, observe disassembly.
Throw -> (Hopefully) get an informative error message, which could
mean you'll be able to fix the bug quickly.


Re: string need to be robust

2011-03-13 Thread Andrei Alexandrescu

On 3/13/11 1:55 PM, Jacob Carlborg wrote:

I would say that the functions should NOT crash but instead throw an
exception. Then the developer can choose what to do when there's an
invalid unicode character.


Yah. In addition, the exception should provide index information such 
that an application can elect to continue processing.


Andrei



Re: string need to be robust

2011-03-13 Thread spir

On 03/13/2011 04:43 PM, ZY Zhou wrote:

If a invalid utf8 or utf16 code need to be converted to utf32, then it should be
converted to an invalid utf32. that's why D800~DFFF are marked as invalid points
in unicode standard.


You are wrong on both points.
First, there is no definition of invalid source conversion into another 
format/encoding; instead it should be treated as invalid, that's all. A 
language or string-processing library should certainly *not* provide any way to 
do that. Instead, it should just signal invalidity by crashing or throwing.
Second, the range you mention is not intended for application use; instead it 
is reserved for special use by utf16; and, as such, invalid.


Since the beginning of this thread, you are demanding for D standard features 
(the *string types or *char[] arrays) to cope with your particular needs of the 
moment, doing your job; at the price of all other use cases of those features 
potentially becoming unsecure or incorrect; crashing loads of existing code 
which rely on correct behaviour; and breaking the standard.

Strange.

Denis


== Quote from spir (denis.s...@gmail.com)'s article

This is not a good idea, imo. Surrogate values /are/ invalid code points. (For
the ones who guess, there are a range of /code unit/ values used to code in
utf16 code points>  0x.) They should never appear in a string of dchar[];
and a string of char[] code units should never encode a non-code point in the





--
_
vita es estrany
spir.wikidot.com



Re: string need to be robust

2011-03-13 Thread Jacob Carlborg

On 2011-03-13 13:22, spir wrote:

On 03/13/2011 10:57 AM, ZY Zhou wrote:

Hi,

I wrote a small program to read and parse html(charset=UTF-8). It
worked great
until some invalid utf8 chars appears in that page.
When the string is invalid, things like foreach or std.string.tolower
will
just crash.
this make the string type totally unusable when processing files,
since there
is no guarantee that utf8 file doesn't contain invalid utf8 chars.

So I made a utf8 decoder myself to convert char[] to dchar[]. In my
decoder, I
convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still
able to
know which part of the string is invalid. Besides, after processing the
dchar[] string, I still can convert it back to utf8 char[] without
affecting
any of the invalid part.

But it is still too easy to crash program with invalid string.
Is it possible to make this a native feature of string? Or is there
any other
recommended method to solve this issue?


D native features *must* crash or throw when the source text is invalid.
What do you think?
What should a square root function do when you pass it negative input?
/You/ may have special requirements for those cases (ignore it, log it,
negate it, replace it with 0 or 1...), but the library must crash
anyway. Your requirements are application-specific needs that /you/ must
define yourself. Hope I'm clear.
D offers an utf8 checking function (checking utf8 beeing the same as
convertingto utf32, it just tries to convert and throws when fails). I
would use before process to do what /you/ expect.

Denis


I would say that the functions should NOT crash but instead throw an 
exception. Then the developer can choose what to do when there's an 
invalid unicode character.


--
/Jacob Carlborg


Re: string need to be robust

2011-03-13 Thread ZY Zhou
If a invalid utf8 or utf16 code need to be converted to utf32, then it should be
converted to an invalid utf32. that's why D800~DFFF are marked as invalid points
in unicode standard.

== Quote from spir (denis.s...@gmail.com)'s article
> This is not a good idea, imo. Surrogate values /are/ invalid code points. (For
> the ones who guess, there are a range of /code unit/ values used to code in
> utf16 code points > 0x.) They should never appear in a string of dchar[];
> and a string of char[] code units should never encode a non-code point in the




Re: string need to be robust

2011-03-13 Thread Michel Fortin

On 2011-03-13 10:18:24 -0400, ZY Zhou  said:


What if I'm making a text editor with D?
I know the text has something wrong, I want to open it and fix it. the 
exception

won't help, if the editor just refuse to open invalid file, then the editor is
useless.
Try open an invalid utf file with a text editor, like vim, you will understand
what I mean


But what is the best thing to do when you got an invalid UTF file in a 
text editor? Perhaps you should show a warning to the user, perhaps you 
also should ask the user to select the right text encoding (because it 
might simply not be UTF-8), or perhaps you want to silently ignore the 
error and show an invalid character marker at the right point in the 
text. All of these options are valid and the programing language 
shouldn't decide that for you.


So I'd point out that a text file editor is a special use case, most 
programs aren't text file editors and don't share this concern. In the 
same vein, HTML parsers are also a special case that should know how to 
handle encodings. In fact, HTML 5 defines explicitly how to deal with 
invalid UTF-8 sequences:



There 

are many good ways to deal with invalid UTF-8 sequences. Throwing an 
exception seems like the most robust one to me since it protects 
against invalid input. What to do with invalid input belongs in the 
application logic, not the language.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/



Re: string need to be robust

2011-03-13 Thread ZY Zhou
What if I'm making a text editor with D?
I know the text has something wrong, I want to open it and fix it. the exception
won't help, if the editor just refuse to open invalid file, then the editor is
useless.
Try open an invalid utf file with a text editor, like vim, you will understand
what I mean

== Quote from spir (denis.s...@gmail.com)'s article
> D offers an utf8 checking function (checking utf8 beeing the same as
> convertingto utf32, it just tries to convert and throws when fails). I would
> use before process to do what /you/ expect.
> Denis



Re: string need to be robust

2011-03-13 Thread spir

On 03/13/2011 01:25 PM, ZY Zhou wrote:

but I think that it's completely unreasonable to expect
>  all of the string-based and/or range-based functions to be able to handle
>  invalid unicode.

As I explained in the first mail, if utf8 parser convert all invalid utf8 chars 
to
low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), other string related functions will still work fine, and you can
also handle these error if you want

string s = "\xa0";
foreach(dchar d; s) {
   if (isValidUnicode(d)) {
 process(d);
   } else {
 handleError(d);
   }
}


PS: You are free to preprocess the source if you like it, and convert invalid 
parts into whatever you like. But instead of surrogates, you'd rather use one 
of the freely usable ranges of values; or use 0 maybe (so that output won't be 
disturbed); or better the code point intended for "un-representable" thingie, 
that all fonts would correctly interpret (and usually display as an inverse 
video '?').


Denis
--
_
vita es estrany
spir.wikidot.com



Re: string need to be robust

2011-03-13 Thread spir

On 03/13/2011 01:25 PM, ZY Zhou wrote:

but I think that it's completely unreasonable to expect
>  all of the string-based and/or range-based functions to be able to handle
>  invalid unicode.

As I explained in the first mail, if utf8 parser convert all invalid utf8 chars 
to
low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), other string related functions will still work fine, and you can
also handle these error if you want

string s = "\xa0";
foreach(dchar d; s) {
   if (isValidUnicode(d)) {
 process(d);
   } else {
 handleError(d);
   }
}


This is not a good idea, imo. Surrogate values /are/ invalid code points. (For 
the ones who guess, there are a range of /code unit/ values used to code in 
utf16 code points > 0x.) They should never appear in a string of dchar[]; 
and a string of char[] code units should never encode a non-code point in the 
surrogate range.


Denis
--
_
vita es estrany
spir.wikidot.com



Re: string need to be robust

2011-03-13 Thread spir

On 03/13/2011 12:34 PM, ZY Zhou wrote:

std.utf throw exception instead of crash the program. but you still need to add
try/catch everywhere.

My point is: this simple code should work, instead of crash, it is supposed to
leave all invalid codes untouched and just process the valid parts.

Stream file = new BufferedFile("sample.txt");
foreach(char[] line; file) {
string s = line.idup.tolower;
}


The line reader /must/ crash or throw if the source is invalid. If you do not 
want to pre-process the source (and convert to surrogates as you do, which IMO 
is not a very good idea) /before/ running the process loop, then read the 
source as plain byte[], check and pre-process invalid text in this form, and 
then only convert to string to be able to use string functions.


There is no builtin type in D that takes random input as supposed string (text) 
/without/ checking it, and still allows to use string operations on it as if it 
were checked to be valid. This would be foolish, don't you think?
The D way to check the source of strings at construction time allows checking 
only once, then safely apply any number of operations. The opposite would 
require to check at each oeration. And would still throw or crash, which is not 
what you expect, I guess: you would like IIUC invalid input to be ignored.


Denis
--
_
vita es estrany
spir.wikidot.com



Re: string need to be robust

2011-03-13 Thread ZY Zhou
> but I think that it's completely unreasonable to expect
> all of the string-based and/or range-based functions to be able to handle
> invalid unicode.

As I explained in the first mail, if utf8 parser convert all invalid utf8 chars 
to
low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), other string related functions will still work fine, and you can
also handle these error if you want

string s = "\xa0";
foreach(dchar d; s) {
  if (isValidUnicode(d)) {
process(d);
  } else {
handleError(d);
  }
}


== Quote from Jonathan M Davis (jmdavisp...@gmx.com)'s article
> On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:
> > std.utf throw exception instead of crash the program. but you still need to
> > add try/catch everywhere.
> >
> > My point is: this simple code should work, instead of crash, it is supposed
> > to leave all invalid codes untouched and just process the valid parts.
> >
> > Stream file = new BufferedFile("sample.txt");
> > foreach(char[] line; file) {
> >string s = line.idup.tolower;
> > }
> I think that it's completely unreasonable to expect all string functions to
> worry about whether they're dealing with valid unicode or not. And a lot of
> string stuff would involve ranges which would require converting each code 
> point
> to UTF-32. And how is it supposed to do _that_ with invalid UTF-8?
> I don't know how you expect to really be able to do anything with invalid 
> UTF-8
> anyway. There may be something that could be added to std.utf to help better
> handle the situation, but I think that it's completely unreasonable to expect
> all of the string-based and/or range-based functions to be able to handle
> invalid unicode.
> - Jonathan M Davis



Re: string need to be robust

2011-03-13 Thread spir

On 03/13/2011 10:57 AM, ZY Zhou wrote:

Hi,

I wrote a small program to read and parse html(charset=UTF-8). It worked great
until some invalid utf8 chars appears in that page.
When the string is invalid, things like foreach or std.string.tolower will
just crash.
this make the string type totally unusable when processing files, since there
is no guarantee that utf8 file doesn't contain invalid utf8 chars.

So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
know which part of the string is invalid. Besides, after processing the
dchar[] string, I still can convert it back to utf8 char[] without affecting
any of the invalid part.

But it is still too easy to crash program with invalid string.
Is it possible to make this a native feature of string? Or is there any other
recommended method to solve this issue?


D native features *must* crash or throw when the source text is invalid. What 
do you think?

What should a square root function do when you pass it negative input?
/You/ may have special requirements for those cases (ignore it, log it, negate 
it, replace it with 0 or 1...), but the library must crash anyway. Your 
requirements are application-specific needs that /you/ must define yourself. 
Hope I'm clear.
D offers an utf8 checking function (checking utf8 beeing the same as 
convertingto utf32, it just tries to convert and throws when fails). I would 
use before process to do what /you/ expect.


Denis
--
_
vita es estrany
spir.wikidot.com



Re: string need to be robust

2011-03-13 Thread %u
== Quote from ZY Zhou (rin...@geemail.com)'s article
> std.utf throw exception instead of crash the program. but you still need to 
> add
> try/catch everywhere.
> My point is: this simple code should work, instead of crash, it is supposed to
> leave all invalid codes untouched and just process the valid parts.
> Stream file = new BufferedFile("sample.txt");
> foreach(char[] line; file) {
>string s = line.idup.tolower;
> }

At first glance I would like to overload the opInvalid? function in the stream
such that iso a throw, the char would be ignored.
Disclaimer: I don't know D2 nor streams :)


Re: string need to be robust

2011-03-13 Thread Jonathan M Davis
On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:
> std.utf throw exception instead of crash the program. but you still need to
> add try/catch everywhere.
> 
> My point is: this simple code should work, instead of crash, it is supposed
> to leave all invalid codes untouched and just process the valid parts.
> 
> Stream file = new BufferedFile("sample.txt");
> foreach(char[] line; file) {
>string s = line.idup.tolower;
> }

I think that it's completely unreasonable to expect all string functions to 
worry about whether they're dealing with valid unicode or not. And a lot of 
string stuff would involve ranges which would require converting each code 
point 
to UTF-32. And how is it supposed to do _that_ with invalid UTF-8?

I don't know how you expect to really be able to do anything with invalid UTF-8 
anyway. There may be something that could be added to std.utf to help better 
handle the situation, but I think that it's completely unreasonable to expect 
all of the string-based and/or range-based functions to be able to handle 
invalid unicode.

- Jonathan M Davis


Re: string need to be robust

2011-03-13 Thread ZY Zhou
std.utf throw exception instead of crash the program. but you still need to add
try/catch everywhere.

My point is: this simple code should work, instead of crash, it is supposed to
leave all invalid codes untouched and just process the valid parts.

Stream file = new BufferedFile("sample.txt");
foreach(char[] line; file) {
   string s = line.idup.tolower;
}



Re: string need to be robust

2011-03-13 Thread Jonathan M Davis
On Sunday 13 March 2011 01:57:12 ZY Zhou wrote:
> Hi,
> 
> I wrote a small program to read and parse html(charset=UTF-8). It worked
> great until some invalid utf8 chars appears in that page.
> When the string is invalid, things like foreach or std.string.tolower will
> just crash.
> this make the string type totally unusable when processing files, since
> there is no guarantee that utf8 file doesn't contain invalid utf8 chars.
> 
> So I made a utf8 decoder myself to convert char[] to dchar[]. In my
> decoder, I convert all invalid utf8 chars to low surrogate code
> points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32
> codes, I'm still able to know which part of the string is invalid.
> Besides, after processing the dchar[] string, I still can convert it back
> to utf8 char[] without affecting any of the invalid part.
> 
> But it is still too easy to crash program with invalid string.
> Is it possible to make this a native feature of string? Or is there any
> other recommended method to solve this issue?

Check out std.utf. It has the functions for dealing with unicode stuff.

- Jonathan M Davis