RE: UTF-8 support during indexing content

Van Tassell, Kristian Wed, 01 Feb 2012 08:38:37 -0800

Travis and all,

This is solved and was not directly a Solr issue. I'll note the solution here 
in case anyone makes the same mistake. The documents are UTF-8 and the source 
documents are converted via XSLT. They look good up to that point.

First off, based off of of some other recommendations I found, I changed the 
Tomcat <Connector> element to include the URIEncoding="UTF-8" setting.

The primary problem, however, was the data (mydata below) was read in without 
an encoding designation. 

DirectXmlRequest up = new DirectXmlRequest( "/update", mydata );

The stream was previously gathered incorrectly:

BufferedReader reader = new BufferedReader(new FileReader(filePath));

I've since changed this and am now getting the intended result.

InputStreamReader reader = new InputStreamReader(new FileInputStream(filePath), 
"UTF-8");

Thanks,
Kristian

-----Original Message-----
From: Travis Low [mailto:t...@4centurion.com] 
Sent: Wednesday, February 01, 2012 8:27 AM
To: solr-user@lucene.apache.org
Subject: Re: UTF-8 support during indexing content

Are you sure the input document is in UTF-8?  That looks like classic
ISO-8859-1-treated-as-UTF-8.

How did you confirm the document contains the right quote marks immediately
prior to uploading?  If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.

cheers,
Travis

On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantass...@siemens.com> wrote:

> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
>        the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word â€�stemmingâ€� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
>        DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
>        solrServer.request( up );
>        solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**

RE: UTF-8 support during indexing content

Reply via email to