On 9/27/2012 2:55 PM, vybe3142 wrote:
Our SOLR setup  (4.0.BETA on Tomcat 6) works as expected when indexing UTF-8
files. Recently, however, we noticed that it has issues with indexing
certain text files eg. UTF-16 files.

I'd wait for a yes/no vote on this from one of the actual experts on this mailing list on this, not just take my word. Here is my guess based on what I know:

Solr uses and expects UTF8. If the program you are using to index the files (which you didn't specify) is capable of working in more than one character set, you should be able to make it work. In order to do so, it must be aware that it is reading UTF16 on the input and translate it (either implicitly or explicitly) into UTF8 when it sends the data to Solr. Your results suggest that the program is assuming UTF8 on the input, perhaps because it can't detect on its own with text files, so if it is capable of multiple character sets, you may have to tell it what it's reading.

I have no idea if the typical way of reading text/word/pdf/other documents (which I think is SolrCell / Tika) can do this, as I have never used it. The data for my Solr index comes from MySQL, which is working entirely in UTF8.

Thanks,
Shawn

Reply via email to