Re: Indexing fails for docs with high Latin1 chars

John Randall Mon, 08 Jul 2013 17:45:05 -0700

Per Jack's suggestion, I changed the heading in the .xml file to <?xml 
version="1.0" encoding="LATIN1"?> and it worked. Thanks so much guys!

________________________________
From: Shawn Heisey <s...@elyograg.org>
To: solr-user@lucene.apache.org 
Sent: Monday, July 8, 2013 7:43 PM
Subject: Re: Indexing fails for docs with high Latin1 chars

On 7/8/2013 4:43 PM, John Randall wrote:
> I'm new to Solr, so I'm probably missing something. So far I've successfully 
> indexed .xml docs with low Ascii chars. However when I try to add a doc that 
> has Latin1 chars with diacritics, it fails. I've tried using the Jetty 
> exampledocs post.jar, as well as using curl and directly from a browser. All 
> three of the following methods work fine when the docs contain Ascii 32-126:
> 
>  From a browser:
> http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml
> 
> 
> Using cURL:
> curl 
> "http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”
> 
> Using post.jar from exampledocs directory
> java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486
> 
> java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml
> 
> 
> I've tried other things: e.g., I've added the following line to the Tomcat 
> server.xml file, <Connector .../> section.
> URIEncoding="UTF-8"
> 
> I've also copied some characters out of the utf8-example.xml file that came 
> with the Jetty app. It still fails. I also changed the offending characters 
> to their unicode equivalent: e.g., N with tilde to Ñ and &Ntilde; without 
> success. For N with tilde and e with acute I get the following message:
> 
> HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
> 
> ________________________________
> 
> type Status report
> message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
> description The request sent by the client was syntactically incorrect.
> 
> ________________________________
> 
> Apache Tomcat/7.0.40
> The file I am trying to add is as follows:
> <?xml version="1.0" encoding="UTF-8"?>
> <add>
> <doc>
>    <field name="id">57917486</field>
>    <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
>    </doc>
> </add>

If I use your xml file (copy/paste from your email), changing the field names 
so it's compatible with my index, it works with Solr 4.4-SNAPSHOT:

[root@bigindy5 ~]# java "-Durl=http://localhost:8982/solr/s0live/update/"; -jar 
/index/src/branch_4x/solr/example/exampledocs/post.jar input.xml
SimplePostTool version 1.5
Posting files to base url http://localhost:8982/solr/s0live/update/ using 
content-type application/xml..
POSTing file input.xml
1 files indexed.
COMMITting Solr index changes to http://localhost:8982/solr/s0live/update/..
Time spent: 0:00:01.385

One thing to note: Solr requires UTF-8 for its input.  If anything in the chain 
(text editor, software outputting the XML, etc.) is using Latin1 rather than 
UTF-8, that could explain the problem.

The hex representation of the UTF-8 character for the N with a tilde accent is 
C3 91 -- two bytes.  I have verified that this is what is in my XML file.

I am betting that your file actually contains the Latin1 representation, which 
is a single byte.  When interpreting that as UTF-8, the byte has the high bit 
set, so Java is expecting the next byte to finish out the character.  The next 
byte is a capital O, or hex 4F, which matches your error message.

The entities that you are trying, like "&Ntilde;", are HTML entities. Those 
entities do not work in XML. XML has a very restricted list of valid entities, 
including &lt; which is the < character.

Perhaps if you used Jack's advice, but told it that it was Latin1 instead of 
UTF-8, it would convert the character to UTF-8 for you.

Thanks,
Shawn

Re: Indexing fails for docs with high Latin1 chars

Reply via email to