Re: Invalid UTF-8 character 0xfffe during shard update

Shawn Heisey Mon, 05 Aug 2013 11:56:14 -0700

On 8/5/2013 12:12 PM, Federico Chiacchiaretta wrote:

Hi Raymond,
I agree with you, 0xfffe is a special character, that is why I was asking
how it's handled in solr.
In my document, 0xfffe does not appear at the beginning, it's in the
content.

I believe that 0xfffe not a valid UTF-8 character, and its presenceindicates something is wrong with your postgres driver, server, or thedata in the database. I use a UTF-8 encoded mysql database with Solrand have no problems. I've used most versions between 1.4.0 and 4.4.0.

Although I'm sure that UTF-8 and UNICODE are not exactly the same thingfor all characters, I think that for this particular case we can treatthem the same:


en.wikipedia.org/wiki/Specials_(Unicode_block)

Relevant excerpt: "FFFE and FFFF are not unassigned in the usual sense,but guaranteed not to be a Unicode character at all. They can be used toguess a text's encoding scheme, since any text containing these is bydefinition not a correctly encoded Unicode text. The U+FEFF is Unicode'sbyte-order mark, named "zero-width no-break space" (as inclusion of itin text shall not be noticed). If this character is read in the wrongbyte order (for example, due to an endianness bug), it will read 0xFFFE,which is illegal Unicode."

See also the error table at the end of this amazon documentation page,which DOES talk about UTF-8 rather than Unicode:


http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html

Thanks,
Shawn

Re: Invalid UTF-8 character 0xfffe during shard update

Reply via email to