RE: verifying that an index contains ONLY utf-8
So you're allowed to put the entire original document in a stored field in Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, beaurocracy. But no reason what you are doing won't work, as you of course already know from doing it. If you actually know the charset of a document when indexing it, you might want to consider putting THAT in a stored field; easier to keep track of the encoding you know then to try and guess it again later. From: Paul [p...@nines.org] Sent: Thursday, January 13, 2011 6:21 PM To: solr-user@lucene.apache.org Subject: Re: verifying that an index contains ONLY utf-8 Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that "reindexes", i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need. I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir wrote: > it does: > http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html > this takes a sample of the file and makes a guess.
Re: verifying that an index contains ONLY utf-8
Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that "reindexes", i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need. I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir wrote: > it does: > http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html > this takes a sample of the file and makes a guess.
Re: verifying that an index contains ONLY utf-8
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind wrote: > > There are various packages of such heuristic algorithms to guess char > encoding, I wouldn't try to write my own. icu4j might include such an > algorithm, not sure. > it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html this takes a sample of the file and makes a guess. also, in general keep in mind that java CharsetDecoders tend to silently replace or skip illegal chars, rather than throw exceptions. If you want to instead be "paranoid" about these things, instead of opening InputStreamReader with Charset, open it with something like charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT) Then if the decoder ends up in some illegal state/byte sequence, instead of silently replacing with U+FFFD, it will throw an exception. Of course as Jonathan says, you cannot "confirm" that something is UTF-8. But many times, you can "confirm" its definitely not: see https://issues.apache.org/jira/browse/SOLR-2003 for an example practical use of this, we throw an exception if we can detect that your stopwords or synonyms file is definitely wrongly-encoded.
Re: verifying that an index contains ONLY utf-8
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so the first place where invalid UTF8 is detected/corrected/etc. is during your analysis process, which takes your raw content and produces char[] based tokens. Second, during indexing, Lucene ensures that the incoming char[] tokens are valid UTF16. If an invalid char sequence is hit, eg naked (unpaired) surrogate, or invalid surrogate pair, the behavior is undefined, but, today, Lucene will replace such invalid char/s with the unicode character U+FFFD, so you could iterate all terms looking for that replacement char. Mike On Wed, Jan 12, 2011 at 5:16 PM, Paul wrote: > We've created an index from a number of different documents that are > supplied by third parties. We want the index to only contain UTF-8 > encoded characters. I have a couple questions about this: > > 1) Is there any way to be sure during indexing (by setting something > in the solr configuration?) that the documents that we index will > always be stored in utf-8? Can solr convert documents that need > converting on the fly, or can solr reject documents containing illegal > characters? > > 2) Is there a way to scan the existing index to find any string > containing non-utf8 characters? Or is there another way that I can > discover if any crept into my index? >
Re: verifying that an index contains ONLY utf-8
Scanning for only 'valid' utf-8 is definitely not simple. You can eliminate some obviously not valid utf-8 things by byte ranges, but you can't confirm valid utf-8 alone by byte ranges. There are some bytes that can only come after or before other certain bytes to be valid utf-8. There is no good way to do what you're doing, once you've lost track of what encoding something is in, you are reduced to applying heuristics to text strings to guess what encoding it is meant to be. There is no cheap way to do this to an entire Solr index, you're just going to have to fetch every single (stored field, indexed fields are pretty much lost to you) and apply heuristic algorithms to it. Keep in mind that Solr really probably shouldn't ever be used as your canonical _store_ of data; Solr isn't a 'store', it's an index. So you really ought to have this stuff stored somewhere else if you want to be able to examine it or modify it like this, and just deal with that somewhere else. This isn't really a Solr question at all, really, even if you are querying Solr on stored fields to try and guess their char encodings. There are various packages of such heuristic algorithms to guess char encoding, I wouldn't try to write my own. icu4j might include such an algorithm, not sure. On 1/13/2011 1:12 PM, Peter Karich wrote: take a look also into icu4j which is one of the contrib projects ... converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that the documents that we index will always be stored in utf-8? Can solr convert documents that need converting on the fly, or can solr reject documents containing illegal characters? 2) Is there a way to scan the existing index to find any string containing non-utf8 characters? Or is there another way that I can discover if any crept into my index?
Re: verifying that an index contains ONLY utf-8
take a look also into icu4j which is one of the contrib projects ... > converting on the fly is not supported by Solr but should be relative > easy in Java. > Also scanning is relative simple (accept only a range). Detection too: > http://www.mozilla.org/projects/intl/chardet.html > >> We've created an index from a number of different documents that are >> supplied by third parties. We want the index to only contain UTF-8 >> encoded characters. I have a couple questions about this: >> >> 1) Is there any way to be sure during indexing (by setting something >> in the solr configuration?) that the documents that we index will >> always be stored in utf-8? Can solr convert documents that need >> converting on the fly, or can solr reject documents containing illegal >> characters? >> >> 2) Is there a way to scan the existing index to find any string >> containing non-utf8 characters? Or is there another way that I can >> discover if any crept into my index? >> > -- http://jetwick.com open twitter search
Re: verifying that an index contains ONLY utf-8
converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html > We've created an index from a number of different documents that are > supplied by third parties. We want the index to only contain UTF-8 > encoded characters. I have a couple questions about this: > > 1) Is there any way to be sure during indexing (by setting something > in the solr configuration?) that the documents that we index will > always be stored in utf-8? Can solr convert documents that need > converting on the fly, or can solr reject documents containing illegal > characters? > > 2) Is there a way to scan the existing index to find any string > containing non-utf8 characters? Or is there another way that I can > discover if any crept into my index? > -- http://jetwick.com open twitter search
Re: verifying that an index contains ONLY utf-8
This is supposed to be dealt with outside the index. All input must be UTF-8 encoded. Failing to do so will give unexpected results. > We've created an index from a number of different documents that are > supplied by third parties. We want the index to only contain UTF-8 > encoded characters. I have a couple questions about this: > > 1) Is there any way to be sure during indexing (by setting something > in the solr configuration?) that the documents that we index will > always be stored in utf-8? Can solr convert documents that need > converting on the fly, or can solr reject documents containing illegal > characters? > > 2) Is there a way to scan the existing index to find any string > containing non-utf8 characters? Or is there another way that I can > discover if any crept into my index?
verifying that an index contains ONLY utf-8
We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that the documents that we index will always be stored in utf-8? Can solr convert documents that need converting on the fly, or can solr reject documents containing illegal characters? 2) Is there a way to scan the existing index to find any string containing non-utf8 characters? Or is there another way that I can discover if any crept into my index?