Re: verifying that an index contains ONLY utf-8

Jonathan Rochkind Thu, 13 Jan 2011 11:05:40 -0800

Scanning for only 'valid' utf-8 is definitely not simple. You caneliminate some obviously not valid utf-8 things by byte ranges, but youcan't confirm valid utf-8 alone by byte ranges. There are some bytesthat can only come after or before other certain bytes to be valid utf-8.

There is no good way to do what you're doing, once you've lost track ofwhat encoding something is in, you are reduced to applying heuristics totext strings to guess what encoding it is meant to be.

There is no cheap way to do this to an entire Solr index, you're justgoing to have to fetch every single (stored field, indexed fields arepretty much lost to you) and apply heuristic algorithms to it. Keep inmind that Solr really probably shouldn't ever be used as your canonical_store_ of data; Solr isn't a 'store', it's an index. So you reallyought to have this stuff stored somewhere else if you want to be able toexamine it or modify it like this, and just deal with that somewhereelse. This isn't really a Solr question at all, really, even if you arequerying Solr on stored fields to try and guess their char encodings.

There are various packages of such heuristic algorithms to guess charencoding, I wouldn't try to write my own. icu4j might include such analgorithm, not sure.


On 1/13/2011 1:12 PM, Peter Karich wrote:

  take a look also into icu4j which is one of the contrib projects ...

converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html

We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?

Re: verifying that an index contains ONLY utf-8

Reply via email to