I had a similar problem a few days ago and I found that the documents where not being loaded correctly as UTF-8 into Solr. In my case, the loader program was a Java.jar I was executing from a cron job. There I added this:
java -Dfile.encoding=UTF-8 -jar /home/tim/solr/bin/loadSiteSearch.jar Then, within that program, I wrote function to take the strings I was loading and expressly declare them as UTF-8 like this: private String toUTF8(String value) { return new String(value.getBytes(), "UTF-8"); } and that solved the problem for me. Tim -----Original Message----- From: Hugh Cayless [mailto:philomou...@gmail.com] Sent: Friday, May 28, 2010 12:51 PM To: solr-user@lucene.apache.org Subject: SolrJ Unicode problem Hi, I'm a solr newbie, and I'm hoping someone can point me in the right direction. I'm trying to index a bunch of documents with Greek text in them. I can successfully index documents by generating add xml and using curl to send them to my server, but when I use solrj to create and send documents, the encoding gets throughly messed up. Instead of the result (from an add doc posted with curl): <result name="response" numFound="1" start="0"> <doc> <str name="id">c.etiq.mom;;2077</str> <str name="transcription">Της Βησο ς Χρη εις Πανοπολίτης</str> </doc> </result> I get (from a SolrInputDocument loaded with solrj): <result name="response" numFound="1" start="0"> <doc> <str name="id">c.etiq.mom;;2077</str> <str name="transcription">??? ???? ? ??? ??? ????�??????</str> </doc> </result> I can confirm that the SolrInputDocument's transcription field contains Greek text before I call .add(documents) on the StreamingUpdateSolrServer (i.e., I can get Greek back out of it). So I don't know what to do next. Any ideas? Thanks, Hugh