On Thursday, March 20, 2014 15:39:50 Walter Bright wrote: > Currently we do it by throwing a UTFException. This has problems: > > 1. about anything that deals with UTF cannot be made nothrow > > 2. turns innocuous errors into major problems, such as DOS attack vectors > http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences > > One option to fix this is to treat invalid sequences as: > > 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32) > > 2. U+FFFD > > I kinda like option 1. > > What do you think?
After a discussion on this a few weeks back (where I was in favor of the current behavior when the discussion started), I'm now completely in favor of making it so that std.utf.decode simply replaces invalid code points with U+FFFD per the standard. Most code won't care and will continue to work as before. The main difference is that invalid Unicode would then fall in the same category as when a program is given a string with characters that it's not supposed to be given. Any code that checks for that sort of thing will then treat invalid Unicode as it would have treated other invalid strings, and code that doesn't care will continue to not care except that now it will work with invalid Unicode instead of throwing. A prime example is something like find. What does it care if it's given invalid Unicode? It will simply look for what you tell it to look for, and if it's not there, it won't find it. U+FFFD will just be one more character that doesn't match what it's looking for. The few programs that really care about whether a string that they're given contains any invalid Unicode can simply validate the string ahead of time. The main problem there is that we need to replace std.utf.validate with something like std.utf.isValidUnicode, because validate makes the horrendous decision of throwing rather than returning a bool (which is what triggered the previous discussion on the topic IIRC). There may be some concern about this change silently changing behavior, but I think that the reality is that the vast majority of programs will continue to work just fine, and our string processing code will be that much cleaner and faster as a result. So, I'm very much inclined to take the path of making this change and putting a warning about it in the changelog rather than not making the change or trying to do this alongside what we currently have. - Jonathan M Davis