Per Jack's suggestion, I changed the heading in the .xml file to <?xml version="1.0" encoding="LATIN1"?> and it worked. Thanks so much guys!
________________________________ From: Shawn Heisey <s...@elyograg.org> To: solr-user@lucene.apache.org Sent: Monday, July 8, 2013 7:43 PM Subject: Re: Indexing fails for docs with high Latin1 chars On 7/8/2013 4:43 PM, John Randall wrote: > I'm new to Solr, so I'm probably missing something. So far I've successfully > indexed .xml docs with low Ascii chars. However when I try to add a doc that > has Latin1 chars with diacritics, it fails. I've tried using the Jetty > exampledocs post.jar, as well as using curl and directly from a browser. All > three of the following methods work fine when the docs contain Ascii 32-126: > > From a browser: > http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml > > > Using cURL: > curl > "http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml” > > Using post.jar from exampledocs directory > java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486 > > java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml > > > I've tried other things: e.g., I've added the following line to the Tomcat > server.xml file, <Connector .../> section. > URIEncoding="UTF-8" > > I've also copied some characters out of the utf8-example.xml file that came > with the Jetty app. It still fails. I also changed the offending characters > to their unicode equivalent: e.g., N with tilde to Ñ and Ñ without > success. For N with tilde and e with acute I get the following message: > > HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37) > > ________________________________ > > type Status report > message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37) > description The request sent by the client was syntactically incorrect. > > ________________________________ > > Apache Tomcat/7.0.40 > The file I am trying to add is as follows: > <?xml version="1.0" encoding="UTF-8"?> > <add> > <doc> > <field name="id">57917486</field> > <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field> > </doc> > </add> If I use your xml file (copy/paste from your email), changing the field names so it's compatible with my index, it works with Solr 4.4-SNAPSHOT: [root@bigindy5 ~]# java "-Durl=http://localhost:8982/solr/s0live/update/" -jar /index/src/branch_4x/solr/example/exampledocs/post.jar input.xml SimplePostTool version 1.5 Posting files to base url http://localhost:8982/solr/s0live/update/ using content-type application/xml.. POSTing file input.xml 1 files indexed. COMMITting Solr index changes to http://localhost:8982/solr/s0live/update/.. Time spent: 0:00:01.385 One thing to note: Solr requires UTF-8 for its input. If anything in the chain (text editor, software outputting the XML, etc.) is using Latin1 rather than UTF-8, that could explain the problem. The hex representation of the UTF-8 character for the N with a tilde accent is C3 91 -- two bytes. I have verified that this is what is in my XML file. I am betting that your file actually contains the Latin1 representation, which is a single byte. When interpreting that as UTF-8, the byte has the high bit set, so Java is expecting the next byte to finish out the character. The next byte is a capital O, or hex 4F, which matches your error message. The entities that you are trying, like "Ñ", are HTML entities. Those entities do not work in XML. XML has a very restricted list of valid entities, including < which is the < character. Perhaps if you used Jack's advice, but told it that it was Latin1 instead of UTF-8, it would convert the character to UTF-8 for you. Thanks, Shawn