RE: What are the Unicode encodings supported by Solr?

Steven A Rowe Thu, 07 May 2009 11:15:13 -0700

Hi KK,

On 5/7/2009 at 2:55 AM, KK wrote:
> In some of the pages I'm getting some \ufffd chars which I think is
> some sort of unmappable[by Java?] character, right?. Any idea on how
> to handle this? Just replacing with blank char will not do [this
> depends on the requirement, though].


>From <http://www.unicode.org/charts/PDF/UFFF0.pdf>:

        FFFD: REPLACEMENT CHARACTER: used to replace an
        incoming character whose value is unknown or
        unrepresentable in Unicode.

Also, from <http://www.unicode.org/versions/Unicode5.1.0/>:

        Applications are free to use any of these noncharacter
        code points internally but should never attempt to
        exchange them. If a noncharacter is received in open
        interchange, an application is not required to
        interpret it in any way. It is good practice, however,
        to recognize it as a noncharacter and to take
        appropriate action, such as replacing it with U+FFFD
        REPLACEMENT CHARACTER, to indicate the problem in the
        text. It is not recommended to simply delete
        noncharacter code points from such text, because of
        the potential security issues caused by deleting
        uninterpreted characters. (See conformance clause C7
        in Section 3.2, Conformance Requirements, and Unicode
        Technical Report #36, "Unicode Security
        Considerations.")

So if you're seeing \ufffd in text, you (or someone before you in the 
processing chain) attempted to convert the text from some other encoding into 
Unicode, but the encoding conversion failed (no target Unicode character 
corresponding to the source character).  This can happen when attempting to 
convert from an incorrectly identified source encoding.

Steve

RE: What are the Unicode encodings supported by Solr?

Reply via email to