On Jan 8, 2010, at 11:17 AM, Robert Muir wrote: > Hello, > > I have been running some tests with english and I noticed that Solr > uses the very slow Porter2 snowball stemmer by default. > In LUCENE-2194 i have proposed a patch to speed this up, of course it > will never be picked up by solr due to the way snowball is > reimplemented here. > This would increased the default for type text, etc by about 10%, not much. > > But actually i would like to propose instead that the PorterStemFilter > (Porter 1) from lucene core be defined as the default instead. > This is significantly faster (my indexing speed was like 2x as fast!) > as this Porter2 snowball stemmer. > I did some relevance tests on a test collection and it actually came > out on top as far as relevance, too. > > I suppose the thing blocking the use of PorterStemFilter is protWords > functionality, but in LUCENE-1515 i proposed adding this to all lucene > stemmers, so maybe we could remove the snowball duplication and > possibly change the default stemmer to the faster PorterStemFilter in > lucene core. > > so basically, i am asking: is there a specific reason this slower > Snowball("English") Porter2 filter is defined as a default?
It's a bit odd, but Solr doesn't really have a "default". What it has is an example schema. Unfortunately, everyone treats the example as the default, so... Yes, it would make sense to speed up the "default" schema as much as possible. There are probably other token filters in there that could be removed, too. It's very good that you are doing this, as I've been wondering lately if it doesn't make sense to seriously evaluate speeding up all the snowball stuff. > > If there isn't, i'd like to suggest we move in these directions, > although it will take some time and not really work until solr and > lucene are synced up again. It shouldn't be that far off, right? I think there is movement underway to put Solr on 3.x.