Hi there, just wanted to share some experiences I recently made in a project. Maybe someone will find this, when googling for a similar problem...
We experienced that the tokenization in Zend Lucene gives different results on our dev-system (Linux) and on our production-system (FreeBSD). Concretly "SomeName™" was splitted into "somename" and "tm" on our Linux system, while is was splitted into "somenametm" on the FreeBSD system. After some researching, we found that Zend Lucene uses ASCII encoding internally by default. For converting the input (UTF-8 in our case) to ASCII, iconv() is used with the target encoding "ASCII//TRANSIT", which maps characters that are not available in the target encoding to similar looking characters. This is where the problems begin, as FreeBSD iconv maps ™ to "TM" while Linux maps it to "(TM)". Different splittings are the result. However, once we found the cause of this problem, it was easy to fix: Just use some UTF8 analyser, e.g. Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive (use Zend_Search_Lucene_Analysis_Analyzer::setDefault()). Then, no internall mapping to ASCII is done and no problems occur. Maybe this should be pointed out in the documentation somehow. Hope it helps. Anselm -- View this message in context: http://www.nabble.com/Zend-Lucene-Tokenization-Pitfalls-tp25136364p25136364.html Sent from the Zend Framework mailing list archive at Nabble.com.