RE: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
So you're allowed to put the entire original document in a stored field in 
Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, 
beaurocracy. But no reason what you are doing won't work, as you of course 
already know from doing it.  

If you actually know the charset of a document when indexing it, you might want 
to consider putting THAT in a stored field; easier to keep track of the 
encoding you know then to try and guess it again later. 


From: Paul [p...@nines.org]
Sent: Thursday, January 13, 2011 6:21 PM
To: solr-user@lucene.apache.org
Subject: Re: verifying that an index contains ONLY utf-8

Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that "reindexes",
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir  wrote:
> it does: 
> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
> this takes a sample of the file and makes a guess.


Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Paul
Thanks for all the responses.

CharsetDetector does look promising. Unfortunately, we aren't allowed
to keep the original of much of our data, so the solr index is the
only place it exists (to us). I do have a java app that "reindexes",
i.e., reads all documents out of one index, does some transform on
them, then writes them to a second index. So I already have a place
where I see all the data in the index stream by. I wanted to make sure
there wasn't some built in way of doing what I need.

I know that it is possible to fool the algorithm, but I'll see if the
string is a possible utf-8 string first and not change that. Then I
won't be introducing more errors and maybe I can detect a large
percentage of the non-utf-8 strings.

On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir  wrote:
> it does: 
> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
> this takes a sample of the file and makes a guess.


Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Robert Muir
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind  wrote:
>
> There are various packages of such heuristic algorithms to guess char
> encoding, I wouldn't try to write my own. icu4j might include such an
> algorithm, not sure.
>

it does: 
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
this takes a sample of the file and makes a guess.

also, in general keep in mind that java CharsetDecoders tend to
silently replace or skip illegal chars, rather than throw exceptions.

If you want to instead be "paranoid" about these things, instead of
opening InputStreamReader with Charset,
open it with something like
charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT)

Then if the decoder ends up in some illegal state/byte sequence,
instead of silently replacing with U+FFFD, it will throw an exception.
Of course as Jonathan says, you cannot "confirm" that something is UTF-8.

But many times, you can "confirm" its definitely not: see
https://issues.apache.org/jira/browse/SOLR-2003 for an example
practical use of this, we throw
an exception if we can detect that your stopwords or synonyms file is
definitely wrongly-encoded.


Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Michael McCandless
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so
the first place where invalid UTF8 is detected/corrected/etc. is
during your analysis process, which takes your raw content and
produces char[] based tokens.

Second, during indexing, Lucene ensures that the incoming char[]
tokens are valid UTF16.

If an invalid char sequence is hit, eg naked (unpaired) surrogate, or
invalid surrogate pair, the behavior is undefined, but, today, Lucene
will replace such invalid char/s with the unicode character U+FFFD, so
you could iterate all terms looking for that replacement char.

Mike

On Wed, Jan 12, 2011 at 5:16 PM, Paul  wrote:
> We've created an index from a number of different documents that are
> supplied by third parties. We want the index to only contain UTF-8
> encoded characters. I have a couple questions about this:
>
> 1) Is there any way to be sure during indexing (by setting something
> in the solr configuration?) that the documents that we index will
> always be stored in utf-8? Can solr convert documents that need
> converting on the fly, or can solr reject documents containing illegal
> characters?
>
> 2) Is there a way to scan the existing index to find any string
> containing non-utf8 characters? Or is there another way that I can
> discover if any crept into my index?
>


Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind
Scanning for only 'valid' utf-8 is definitely not simple.  You can 
eliminate some obviously not valid utf-8 things by byte ranges, but you 
can't confirm valid utf-8 alone by byte ranges. There are some bytes 
that can only come after or before other certain bytes to be valid utf-8.


There is no good way to do what you're doing, once you've lost track of 
what encoding something is in, you are reduced to applying heuristics to 
text strings to guess what encoding it is meant to be.


There is no cheap way to do this to an entire Solr index, you're just 
going to have to fetch every single (stored field, indexed fields are 
pretty much lost to you) and apply heuristic algorithms to it.  Keep in 
mind that Solr really probably shouldn't ever be used as your canonical 
_store_ of data; Solr isn't a 'store', it's an index.  So you really 
ought to have this stuff stored somewhere else if you want to be able to 
examine it or modify it like this, and just deal with that somewhere 
else.  This isn't really a Solr question at all, really, even if you are 
querying Solr on stored fields to try and guess their char encodings.


There are various packages of such heuristic algorithms to guess char 
encoding, I wouldn't try to write my own. icu4j might include such an 
algorithm, not sure.


On 1/13/2011 1:12 PM, Peter Karich wrote:

  take a look also into icu4j which is one of the contrib projects ...


converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html


We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?





Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Peter Karich
 take a look also into icu4j which is one of the contrib projects ...

> converting on the fly is not supported by Solr but should be relative
> easy in Java.
> Also scanning is relative simple (accept only a range). Detection too:
> http://www.mozilla.org/projects/intl/chardet.html
>
>> We've created an index from a number of different documents that are
>> supplied by third parties. We want the index to only contain UTF-8
>> encoded characters. I have a couple questions about this:
>>
>> 1) Is there any way to be sure during indexing (by setting something
>> in the solr configuration?) that the documents that we index will
>> always be stored in utf-8? Can solr convert documents that need
>> converting on the fly, or can solr reject documents containing illegal
>> characters?
>>
>> 2) Is there a way to scan the existing index to find any string
>> containing non-utf8 characters? Or is there another way that I can
>> discover if any crept into my index?
>>
>


-- 
http://jetwick.com open twitter search



Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Peter Karich

converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html

> We've created an index from a number of different documents that are
> supplied by third parties. We want the index to only contain UTF-8
> encoded characters. I have a couple questions about this:
>
> 1) Is there any way to be sure during indexing (by setting something
> in the solr configuration?) that the documents that we index will
> always be stored in utf-8? Can solr convert documents that need
> converting on the fly, or can solr reject documents containing illegal
> characters?
>
> 2) Is there a way to scan the existing index to find any string
> containing non-utf8 characters? Or is there another way that I can
> discover if any crept into my index?
>


-- 
http://jetwick.com open twitter search



Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Markus Jelsma
This is supposed to be dealt with outside the index. All input must be UTF-8 
encoded. Failing to do so will give unexpected results.

> We've created an index from a number of different documents that are
> supplied by third parties. We want the index to only contain UTF-8
> encoded characters. I have a couple questions about this:
> 
> 1) Is there any way to be sure during indexing (by setting something
> in the solr configuration?) that the documents that we index will
> always be stored in utf-8? Can solr convert documents that need
> converting on the fly, or can solr reject documents containing illegal
> characters?
> 
> 2) Is there a way to scan the existing index to find any string
> containing non-utf8 characters? Or is there another way that I can
> discover if any crept into my index?


verifying that an index contains ONLY utf-8

2011-01-12 Thread Paul
We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?