char sets accepted via xml

Mark Cunningham Wed, 15 Jun 2011 05:10:53 -0700

Hi,

If you submit information to solr using xml, does the server assume you're
using unicode encoded in utf8? And does it accept the whole range of
possible characters in unicode? (For example, characters that require
multiple bytes when encoded in utf-8).


I'm getting quite a few "Invalid UTF-8 middle byte 0x20 (at char #408, byte
#-1)" errors (with different bytes/characters) that seem to be coming from
characters such as the trademark symbol or registered or some characters
that look like normal characters (such as a dash). It comes out as UTF-8
code units (E2 80 93) using this very handy website
http://rishida.net/tools/conversion/

I tried inserting <?xml version="1.0" encoding="utf-8"?> at the start of the
xml however this didn't seem to make much difference.

Anyone else have these issues or know what they might be coming from?

Mark

char sets accepted via xml

Reply via email to