I'm new to Solr, so I'm probably missing something. So far I've successfully 
indexed .xml docs with low Ascii chars. However when I try to add a doc that 
has Latin1 chars with diacritics, it fails. I've tried using the Jetty 
exampledocs post.jar, as well as using curl and directly from a browser. All 
three of the following methods work fine when the docs contain Ascii 32-126:

From a browser:
http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml


Using cURL:
curl 
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”
 
Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486

java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml


I've tried other things: e.g., I've added the following line to the Tomcat 
server.xml file, <Connector .../> section.
URIEncoding="UTF-8"
 
I've also copied some characters out of the utf8-example.xml file that came 
with the Jetty app. It still fails. I also changed the offending characters to 
their unicode equivalent: e.g., N with tilde to &#209; and &Ntilde; without 
success. For N with tilde and e with acute I get the following message:

HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)

________________________________

type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.

________________________________

Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
   <field name="id">57917486</field>
   <field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
  </doc>
</add> 



My schema.xml file contains following fieldtypes:
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

   <!--For descrip_fw field (and trailing wildcard searches):-->
  <fieldType name="search_fw" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" side="front"/>
    </analyzer>
    <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

  <!-- For leading wildcard searches, I've added the following copy field type 
using a copy field:
   -->
  <fieldType name="search_rev" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer type="index">
     <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="20" side="back"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>



My schema.xml file contains following pertinent fields:
   <field name="id" type="string" indexed="true" stored="true" 
required="true"/> 
   <field name="descrip_fw" type="search_fw" indexed="true" stored="false" 
required="false"/>
   <copyField source="descrip_fw" dest="descrip_rev"/>


Also, I am using Tomcat as container on a Windows XP SP3 machine.
As I said this all works as long as the docs contain no high Latin1 characters.

I'd appreciate any ideas you many have.

Reply via email to