Indexing fails for docs with high Latin1 chars

John Randall Mon, 08 Jul 2013 15:43:51 -0700

I'm new to Solr, so I'm probably missing something. So far I've successfully 
indexed .xml docs with low Ascii chars. However when I try to add a doc that 
has Latin1 chars with diacritics, it fails. I've tried using the Jetty 
exampledocs post.jar, as well as using curl and directly from a browser. All 
three of the following methods work fine when the docs contain Ascii 32-126:


From a browser:
http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml


Using cURL:
curl 
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”
 
Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486

java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml


I've tried other things: e.g., I've added the following line to the Tomcat 
server.xml file, <Connector .../> section.
URIEncoding="UTF-8"
 
I've also copied some characters out of the utf8-example.xml file that came 
with the Jetty app. It still fails. I also changed the offending characters to 
their unicode equivalent: e.g., N with tilde to &#209; and &Ntilde; without 
success. For N with tilde and e with acute I get the following message:

HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)

________________________________

type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.

________________________________

Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
   <field name="id">57917486</field>
   <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
  </doc>
</add> 



My schema.xml file contains following fieldtypes:
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

   <!--For descrip_fw field (and trailing wildcard searches):-->
  <fieldType name="search_fw" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" side="front"/>
    </analyzer>
    <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- For leading wildcard searches, I've added the following copy field type 
using a copy field:
   -->
  <fieldType name="search_rev" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer type="index">
     <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" side="back"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>



My schema.xml file contains following pertinent fields:
   <field name="id" type="string" indexed="true" stored="true" 
required="true"/> 
   <field name="descrip_fw" type="search_fw" indexed="true" stored="false" 
required="false"/>
   <copyField source="descrip_fw" dest="descrip_rev"/>


Also, I am using Tomcat as container on a Windows XP SP3 machine.
As I said this all works as long as the docs contain no high Latin1 characters.

I'd appreciate any ideas you many have.

Indexing fails for docs with high Latin1 chars

Reply via email to