David Hopwood <[EMAIL PROTECTED]> wrote: > [I've thought about this a bit more, and I'm now convinced that it's > useful to have a separate, standardised code for this - say > U+FDEF ILL-FORMED INPUT MARKER. (Can noncharacters have names?)
Nope. They're noncharacters. They do not exist; they never existed. Why would anyone, faced with a UTF-8 file that contains invalid sequences, want to retain the invalid sequences, much less convert the file to another encoding form that either (a) preserves the invalid sequences or (b) leaves a marker showing where they were? Invalid sequences are garbage. They don't represent anything, and you can't always even tell what they were supposed to represent. >> That's why U+2060 WORD JOINER is being introduced in Unicode 3.2. >> Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can >> then be used *solely* as a BOM. Eventually, if this happens, it will >> become safe to strip BOM's as they appear. > > No it won't: silently stripping characters without considering that to be > a change to the string is a potential security problem. It's unlikely > that this would be a problem at the start of a *file*, but "UTF-16" in > the sense of the IANA-registered charset of that name (i.e. swap byte > order every time you see "U+FFFE", and strip U+FEFF anywhere it appears), > is simply a bad idea IMHO. You can never strip or convert anything in complete blindness, of course; even converting LF to CRLF when moving a file from a Unix system to a Windows system would affect the CRC, which might cause some alarms to go off. This is where I agree with you about silently converting non-initial ZWNBSP to WORD JOINER, as strongly as I support removing the ZWNBSP semantics from U+FEFF. If we are talking about a system or application that only needs to preserve certain semantics for the human reader, it's fine (as are LF->CRLF, stripped BOM's, and maybe even some edge cases like converting between tabs and spaces). If there are any security or spoofing concerns, it's best to leave everything completely untouched. -Doug Ewell Fullerton, California