I had a similar problem a few days ago and I found that the documents where not 
being loaded correctly as UTF-8 into Solr.  In my case, the loader program was 
a Java.jar I was executing from a cron job.  There I added this:

java -Dfile.encoding=UTF-8 -jar /home/tim/solr/bin/loadSiteSearch.jar

Then, within that program, I wrote function to take the strings I was loading 
and expressly declare them as UTF-8 like this:

private String toUTF8(String value)
{
        return new String(value.getBytes(), "UTF-8");
}

and that solved the problem for me.

Tim

-----Original Message-----
From: Hugh Cayless [mailto:philomou...@gmail.com] 
Sent: Friday, May 28, 2010 12:51 PM
To: solr-user@lucene.apache.org
Subject: SolrJ Unicode problem

Hi, I'm a solr newbie, and I'm hoping someone can point me in the right 
direction.

I'm trying to index a bunch of documents with Greek text in them.  I can 
successfully index documents by generating add xml and using curl to send them 
to my server, but when I use solrj to create and send documents, the encoding 
gets throughly messed up.


Instead of the result (from an add doc posted with curl):

<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">c.etiq.mom;;2077</str>
    <str name="transcription">Της Βησο ς Χρη εις Πανοπολίτης</str>
  </doc>
</result>

I get (from a SolrInputDocument loaded with solrj):

<result name="response" numFound="1" start="0"> 
 <doc> 
  <str name="id">c.etiq.mom;;2077</str> 
  <str name="transcription">??? ???? ? ??? ??? ????�??????</str> 
 </doc> 
</result>

I can confirm that the SolrInputDocument's transcription field contains Greek 
text before I call .add(documents) on the StreamingUpdateSolrServer (i.e., I 
can get Greek back out of it).  So I don't know what to do next.  Any ideas?

Thanks,
Hugh

Reply via email to