Unicode noncharacters are perfectly valid for the purpose of interchange 
(though as Robert points out, XML has its own ideas about this, separately from 
the Unicode standard).

From <http://www.unicode.org/faq/private_user.html>:

        Q: Are noncharacters invalid in Unicode strings and UTFs?

        A: Absolutely not. Noncharacters do not cause a Unicode string
        to be ill-formed in any UTF. This can be seen explicitly in the
        table above, where every noncharacter code point has a well-
        formed representation in UTF-32, in UTF-16, and in UTF-8. An
        implementation which converts noncharacter code points between
        one UTF representation and another must preserve these values
        correctly. The fact that they are called "noncharacters" and
        are not intended for open interchange does not mean that they
        are somehow illegal or invalid code points which make strings
        containing them invalid.

Also, from <http://www.unicode.org/versions/corrigendum9.html>.

        Noncharacters […] are not illegal in interchange nor do they
        cause ill-formed Unicode text. […] The real intent of non-
        characters is that they are permanently prohibited from being
        assigned standard, interchangeable meanings, rather than that
        they are prohibited from occurring in Unicode strings which
        happen to be interchanged.

Steve

On Aug 5, 2013, at 3:03 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote:

> 
> : > 0xfffe is not a special character -- it is explicitly *not* a character in
> : > Unicode at all, it is set asside as "not a character." specifically so
> : > that the character 0xfeff can be used as a BOM, and if the BOM is read
> : > incorrectly, it will cause an error.
> : 
> : XML doesnt allow control character like this, it defines character as:
> 
> But is that even relevant?  I thought FFFE was *not* a control character? 
> I thought it was completely invaid in Unicode.
> 
> I get that the specific error here is from the XML parser -- but my 
> question is wether U+FFFE is actaully valid (in which case perhaps there 
> is something solr can/should be doing here when serializing/deserializing 
> to "escape" (or maybe just strip) the caracter; or is this just completley 
> 100% not valid in Unicode at all? (which was my understanding, in which 
> case i don't get why the DB or JDBC driver or JVM didn't complain before 
> Solr ever got it as a Strin)
> 
> : Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> : [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate
> : blocks, FFFE, and FFFF. */
> 
> 
> -Hoss

Reply via email to