Sebi wrote:
OK Alexander. I understand this. How can I manage this situation? Because I will index all words from text fields (this is the default behavior of the tokenizer, isn't it?). So, there will be words like 'and', 'a', 'an', 'than' and many others which will apear in many documents. I know that MYSQL fulltext index has a full list with these common words, and they exclude this words from the index.

Tell me how can I select common terms in an efficient way. Where should I add this? Is there a class which I can extend?
I wait your answer.

There are two additional analyzer filters (thanks to Lukas!).

StopWords filter and ShortWords filter.

Usage example:
---------------------------
$stopWords = array('a', 'an', 'at', 'the', 'and', 'or', 'is', 'am');
$stopWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_StopWords($stopWords);

$analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($stopWordsFilter);

Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
---------------------------
$stopWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_StopWords();
$stopWordsFilter->loadFromFile($my_stopwords_file);

$analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($stopWordsFilter);

Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
---------------------------
$shortWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_ShortWords();

$analyzer = new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($shortWordsFilter);

Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
---------------------------

I've just updated the documentation (Zend_Search. Extensibility. section) and made some small fixes.
Please take SVN version to work with these filters.


With best regards,
   Alexander Veremyev.


Reply via email to