This is supposed to be dealt with outside the index. All input must be UTF-8 encoded. Failing to do so will give unexpected results.
> We've created an index from a number of different documents that are > supplied by third parties. We want the index to only contain UTF-8 > encoded characters. I have a couple questions about this: > > 1) Is there any way to be sure during indexing (by setting something > in the solr configuration?) that the documents that we index will > always be stored in utf-8? Can solr convert documents that need > converting on the fly, or can solr reject documents containing illegal > characters? > > 2) Is there a way to scan the existing index to find any string > containing non-utf8 characters? Or is there another way that I can > discover if any crept into my index?