On 8/5/2013 12:12 PM, Federico Chiacchiaretta wrote:
Hi Raymond,
I agree with you, 0xfffe is a special character, that is why I was asking
how it's handled in solr.
In my document, 0xfffe does not appear at the beginning, it's in the
content.
I believe that 0xfffe not a valid UTF-8 character, and its presence
indicates something is wrong with your postgres driver, server, or the
data in the database. I use a UTF-8 encoded mysql database with Solr
and have no problems. I've used most versions between 1.4.0 and 4.4.0.
Although I'm sure that UTF-8 and UNICODE are not exactly the same thing
for all characters, I think that for this particular case we can treat
them the same:
en.wikipedia.org/wiki/Specials_(Unicode_block)
Relevant excerpt: "FFFE and FFFF are not unassigned in the usual sense,
but guaranteed not to be a Unicode character at all. They can be used to
guess a text's encoding scheme, since any text containing these is by
definition not a correctly encoded Unicode text. The U+FEFF is Unicode's
byte-order mark, named "zero-width no-break space" (as inclusion of it
in text shall not be noticed). If this character is read in the wrong
byte order (for example, due to an endianness bug), it will read 0xFFFE,
which is illegal Unicode."
See also the error table at the end of this amazon documentation page,
which DOES talk about UTF-8 rather than Unicode:
http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html
Thanks,
Shawn