On 03/13/2011 10:57 AM, ZY Zhou wrote:
Hi,

I wrote a small program to read and parse html(charset=UTF-8). It worked great
until some invalid utf8 chars appears in that page.
When the string is invalid, things like foreach or std.string.tolower will
just crash.
this make the string type totally unusable when processing files, since there
is no guarantee that utf8 file doesn't contain invalid utf8 chars.

So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
know which part of the string is invalid. Besides, after processing the
dchar[] string, I still can convert it back to utf8 char[] without affecting
any of the invalid part.

But it is still too easy to crash program with invalid string.
Is it possible to make this a native feature of string? Or is there any other
recommended method to solve this issue?

D native features *must* crash or throw when the source text is invalid. What do you think?
What should a square root function do when you pass it negative input?
/You/ may have special requirements for those cases (ignore it, log it, negate it, replace it with 0 or 1...), but the library must crash anyway. Your requirements are application-specific needs that /you/ must define yourself. Hope I'm clear. D offers an utf8 checking function (checking utf8 beeing the same as convertingto utf32, it just tries to convert and throws when fails). I would use before process to do what /you/ expect.

Denis
--
_________________
vita es estrany
spir.wikidot.com

Reply via email to