https://issues.dlang.org/show_bug.cgi?id=14519
--- Comment #10 from Vladimir Panteleev <thecybersha...@gmail.com> --- OK, I see from your post that you don't see many of the problems with the replacement character. Let me show you some example problematic situations: 1. Bob wants to update his company's documents to use the new name for his product. He writes a program that does a recursive pattern search & replace in a directory. After testing the program on a few sample files, he is satisfied with the results, and runs the program on his company's document store. Six months later, long after the documents went out of backup rotation, Sue finds that some important historical documents have been irreversibly corrupted and full of Unicode replacement characters encoded as UTF-8. Why? Because these old documents did not use UTF-8, and Bob used D. 2. Bob is writing a secure server-side software package (let's say, a confidential document store). He is using a std.algorithm-based hashing algorithm to store the passwords securely. At some point, Mary signs up and creates a secure password, which contains entirely Cyrillic letters (let's say, "ЭтоМойПароль"). Not long after, Eve successfully logs into Mary's account with the password "ЯЯЯЯЯЯЯЯЯЯЯЯ". Why? Because the passwords just happened to be sent in some non-UTF-8 encoding, and, since Bob used D, when "normalized" through std.algorithm's replacement character subtitution, all Unicode-only passwords of the same length have the same hash. Automatic use of the replacement character will come as a surprise to many people who come from other languages. For example, in Delphi, strings are also the de-facto ubyte[] / void[] type - you can safely read a binary file into a string, perform search and replace, and write it back, knowing that the result will be exactly what you expected. Furthermore, from your message it appears to me that you've missed the point of my argument: > What do you do if you read in an XML file and process half of it before you > hit invalid Unicode? You abort! This should not happen. Either the XML file is in an incorrect encoding (which puts to question the integrity of all the data parsed so far - what if it was some 8-bit encoding that only LOOKED like valid UTF-8?) or the program should've sanitized the input first if it really didn't care about data correctness. But this is an XML file, meaning it's very likely to be machine generated - if it contains errors, it might indicate a problem somewhere else in the system, which is why it's all the more important to abort and get the user to figure out the true source of the problem. Ignoring the error here reminds me of how PHP never stops on errors by default, or Basic's "ON ERROR GOTO NEXT". > So, throwing an Error is forcing everyone to validate the Unicode in their > strings whether they care or not, and using the replacement character will > work, whereas the programs that do care about validating their strings should > be doing the validation up front anyway. Yes, but then there is no way to make sure you're not accidentally corrupting data! Whereas now we only have a runtime check against invalid UTF-8, now we will have no check at all. With no automatic mechanism to ensure that all text is sanitized before it gets into std.algorithm, it becomes impossible to be sure that you're not accidentally corrupting data along the way. --