Sebi wrote:
OK Alexander. I understand this. How can I manage this situation?
Because I will index all words from text fields (this is the default
behavior of the tokenizer, isn't it?). So, there will be words like
'and', 'a', 'an', 'than' and many others which will apear in many
documents. I know that MYSQL fulltext index has a full list with these
common words, and they exclude this words from the index.
Tell me how can I select common terms in an efficient way. Where should
I add this? Is there a class which I can extend?
I wait your answer.
There are two additional analyzer filters (thanks to Lukas!).
StopWords filter and ShortWords filter.
Usage example:
---------------------------
$stopWords = array('a', 'an', 'at', 'the', 'and', 'or', 'is', 'am');
$stopWordsFilter = new
Zend_Search_Lucene_Analysis_TokenFilter_StopWords($stopWords);
$analyzer = new
Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($stopWordsFilter);
Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
---------------------------
$stopWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_StopWords();
$stopWordsFilter->loadFromFile($my_stopwords_file);
$analyzer = new
Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($stopWordsFilter);
Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
---------------------------
$shortWordsFilter = new
Zend_Search_Lucene_Analysis_TokenFilter_ShortWords();
$analyzer = new
Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($shortWordsFilter);
Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
---------------------------
I've just updated the documentation (Zend_Search. Extensibility.
section) and made some small fixes.
Please take SVN version to work with these filters.
With best regards,
Alexander Veremyev.