[fw-general] Zend Lucene Tokenization Pitfalls

Anselm Föhr Tue, 25 Aug 2009 08:16:00 -0700

Hi there,

just wanted to share some experiences I recently made in a project. Maybe
someone will find this, when googling for a similar problem...


We experienced that the tokenization in Zend Lucene gives different results
on our dev-system (Linux) and on our production-system (FreeBSD). Concretly
"SomeName™" was splitted into "somename" and "tm" on our Linux system, while
is was splitted into "somenametm" on the FreeBSD system.

After some researching, we found that Zend Lucene uses ASCII encoding
internally by default. For converting the input (UTF-8 in our case) to
ASCII, iconv() is used with the target encoding "ASCII//TRANSIT", which maps
characters that are not available in the target encoding to similar looking
characters.
This is where the problems begin, as FreeBSD iconv maps ™ to "TM" while
Linux maps it to "(TM)". Different splittings are the result.

However, once we found the cause of this problem, it was easy to fix:
Just use some UTF8 analyser, e.g.
Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive (use
Zend_Search_Lucene_Analysis_Analyzer::setDefault()). Then, no internall
mapping to ASCII is done and no problems occur.

Maybe this should be pointed out in the documentation somehow.

Hope it helps.
 Anselm
-- 
View this message in context: 
http://www.nabble.com/Zend-Lucene-Tokenization-Pitfalls-tp25136364p25136364.html
Sent from the Zend Framework mailing list archive at Nabble.com.

[fw-general] Zend Lucene Tokenization Pitfalls

Reply via email to