I ended up with this:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"
side="front"/>
and it works great! It's important to specify side or the N-gram
buildout is really huge. My users generally will start typing their
wildcard searches left-anchored, so it was not only overkill to
have all the generated stems, but was causing way too many
false positives to hit.
To provide some on-the-fly documentation of the above, if
you have:
sm333k carbon shoes
the tokens generated, given my specs above, are:
sm3 sm33 sm333 sm333k car carb carbo carbon sho shoe shoes
For a word with 7+ characters, it would make the 4 N-grams
of length 3 to 6 starting with the 1st char. It's like:
for (i=3..6) {
token=substr(x, 0, i);
}
Thanks for pointing me in this direction!
On Thu, Nov 15, 2012 at 4:59 PM, Upayavira <[email protected]> wrote:
> Remember to distinguish between recall and precision - you're likely to
> get too many results, but what matters is whether the first ones are
> useful.
>
> You could have two versions of your field, one with normal stemming,
> another with n-grams, and boost the normal field above the n-gram one,
> give exact matches a boost above inexact matches.
>
> Upayavira
>
> On Thu, Nov 15, 2012, at 09:48 PM, David Alyea wrote:
> > OK, I tried that. Had just Snowball and EdgeNGram
> > in both index and query. When I ran the "sm3 carbon"
> > select, it went from 3,500 matches to 89,000! So yes,
> > that edge building works! But too much. And... the
> > top score matches didn't look at all like "sm3 carbon"
> > products, and the shoes were no where in sight. So,
> > I'll toy with it on a dev instance and see what I see.
> > I definitely like the idea and I can see that N-gram
> > tokens are going to behave like wildcarding.
> >
> > On Thu, Nov 15, 2012 at 4:13 PM, Robert Muir <[email protected]> wrote:
> >
> > > On Thu, Nov 15, 2012 at 9:44 AM, David Alyea <[email protected]> wrote:
> > > >
> > > > to index:
> > > > <filter class="solr.PorterStemFilterFactory"/>
> > > > <filter class="solr.KStemFilterFactory"/>
> > > > <filter class="solr.EnglishMinimalStemFilterFactory"/>
> > > >
> > > > to query:
> > > > <filter class="solr.SnowballPorterFilterFactory" language="English"
> />
> > > >
> > >
> > > I don't think its a good idea to use 4 different stemming algorithms
> > > (porter1, kstem, plural at index-time) and porter2 at query-time.
> > > This means you are analyzing terms in a totally different way at index
> > > time than you are at query-time.
> > >
> > > Just pick one of them: make your index-time and query-time analysis
> > > the same as a start and I think you will see less surprises.
> > >
>