On 8/5/2013 12:12 PM, Federico Chiacchiaretta wrote:
Hi Raymond,
I agree with you, 0xfffe is a special character, that is why I was asking
how it's handled in solr.
In my document, 0xfffe does not appear at the beginning, it's in the
content.

I believe that 0xfffe not a valid UTF-8 character, and its presence indicates something is wrong with your postgres driver, server, or the data in the database. I use a UTF-8 encoded mysql database with Solr and have no problems. I've used most versions between 1.4.0 and 4.4.0.

Although I'm sure that UTF-8 and UNICODE are not exactly the same thing for all characters, I think that for this particular case we can treat them the same:

en.wikipedia.org/wiki/Specials_(Unicode_block)

Relevant excerpt: "FFFE and FFFF are not unassigned in the usual sense, but guaranteed not to be a Unicode character at all. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. The U+FEFF is Unicode's byte-order mark, named "zero-width no-break space" (as inclusion of it in text shall not be noticed). If this character is read in the wrong byte order (for example, due to an endianness bug), it will read 0xFFFE, which is illegal Unicode."

See also the error table at the end of this amazon documentation page, which DOES talk about UTF-8 rather than Unicode:

http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html

Thanks,
Shawn

Reply via email to