idea to speed up indexing defaults

Robert Muir Fri, 08 Jan 2010 08:18:46 -0800

Hello,

I have been running some tests with english and I noticed that Solr
uses the very slow Porter2 snowball stemmer by default.
In LUCENE-2194 i have proposed a patch to speed this up, of course it
will never be picked up by solr due to the way snowball is
reimplemented here.
This would increased the default for type text, etc by about 10%, not much.


But actually i would like to propose instead that the PorterStemFilter
(Porter 1) from lucene core be defined as the default instead.
This is significantly faster (my indexing speed was like 2x as fast!)
as this Porter2 snowball stemmer.
I did some relevance tests on a test collection and it actually came
out on top as far as relevance, too.

I suppose the thing blocking the use of PorterStemFilter is protWords
functionality, but in LUCENE-1515 i proposed adding this to all lucene
stemmers, so maybe we could remove the snowball duplication and
possibly change the default stemmer to the faster PorterStemFilter in
lucene core.

so basically, i am asking: is there a specific reason this slower
Snowball("English") Porter2 filter is defined as a default?

If there isn't, i'd like to suggest we move in these directions,
although it will take some time and not really work until solr and
lucene are synced up again.

thanks in advance for any ideas.

-- 
Robert Muir
rcm...@gmail.com

idea to speed up indexing defaults

Reply via email to