Look for strange encodings -- tokenization

poeta simbolista Tue, 04 Sep 2007 07:39:09 -0700

Hi all,

I'd want to know the best way to look for strange encodings on a Lucene
index.
i have several inputs where input can have been encoded on different sets. I
not always know if my guess about the encoding has been ok. Hence, I'd
thought of querying the index for some typical strings that would show bad
encodings.


All the index has been already constructed using the StandardAnalyzer. I
have read using another analyzer could yield some unexpected results... But
I suppose that's ok for my purposes - testing quality of the index.

Which way do you think it is better to tackle this issue? I've been taking a
look at the Analyzers -- the StandardAnalyzer. I thought about creating a
custom tokenizer that splits on letter, number, spaces so it only leaves
"weird" strings as tokens -- they will show bad encodings. Still, and
possibly due to lack of knowledge of lucene .) I have the feeling this can
be done better somehow.

Thanks a lot in advance!
-- 
View this message in context: 
http://www.nabble.com/Look-for-strange-encodings----tokenization-tf4378064.html#a12479370
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Look for strange encodings -- tokenization

Reply via email to