Re: Can SOLR Index UTF-16 Text

Shawn Heisey Fri, 28 Sep 2012 07:01:25 -0700

On 9/27/2012 2:55 PM, vybe3142 wrote:

Our SOLR setup  (4.0.BETA on Tomcat 6) works as expected when indexing UTF-8
files. Recently, however, we noticed that it has issues with indexing
certain text files eg. UTF-16 files.

I'd wait for a yes/no vote on this from one of the actual experts onthis mailing list on this, not just take my word. Here is my guessbased on what I know:

Solr uses and expects UTF8. If the program you are using to index thefiles (which you didn't specify) is capable of working in more than onecharacter set, you should be able to make it work. In order to do so,it must be aware that it is reading UTF16 on the input and translate it(either implicitly or explicitly) into UTF8 when it sends the data toSolr. Your results suggest that the program is assuming UTF8 on theinput, perhaps because it can't detect on its own with text files, so ifit is capable of multiple character sets, you may have to tell it whatit's reading.

I have no idea if the typical way of reading text/word/pdf/otherdocuments (which I think is SolrCell / Tika) can do this, as I havenever used it. The data for my Solr index comes from MySQL, which isworking entirely in UTF8.


Thanks,
Shawn

Re: Can SOLR Index UTF-16 Text

Reply via email to