Scanning for only 'valid' utf-8 is definitely not simple. You can
eliminate some obviously not valid utf-8 things by byte ranges, but you
can't confirm valid utf-8 alone by byte ranges. There are some bytes
that can only come after or before other certain bytes to be valid utf-8.
There is no good way to do what you're doing, once you've lost track of
what encoding something is in, you are reduced to applying heuristics to
text strings to guess what encoding it is meant to be.
There is no cheap way to do this to an entire Solr index, you're just
going to have to fetch every single (stored field, indexed fields are
pretty much lost to you) and apply heuristic algorithms to it. Keep in
mind that Solr really probably shouldn't ever be used as your canonical
_store_ of data; Solr isn't a 'store', it's an index. So you really
ought to have this stuff stored somewhere else if you want to be able to
examine it or modify it like this, and just deal with that somewhere
else. This isn't really a Solr question at all, really, even if you are
querying Solr on stored fields to try and guess their char encodings.
There are various packages of such heuristic algorithms to guess char
encoding, I wouldn't try to write my own. icu4j might include such an
algorithm, not sure.
On 1/13/2011 1:12 PM, Peter Karich wrote:
take a look also into icu4j which is one of the contrib projects ...
converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html
We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:
1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?
2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?