Re: How to handle words that stem to stop words

2014-07-07 Thread David Murgatroyd
Arjen,

An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.

A commercial offering, Rosette Search Essentials from Basis
<http://www.basistech.com/search-essentials/> (full disclosure: my
employer), which is free for development use and can be downloaded via that
link, uses textual context to disambiguate lemmas as in the screenshot
below -- compare the lemma for token #13 (van) v. token #25 (vans). (I
don't read/write Dutch; I took these snippets from the web.) The work
integrating OpenNLP <https://issues.apache.org/jira/browse/LUCENE-2899>
might also prove helpful.

Best,
David Murgatroyd
ww.linkedin.com/in/dmurga/ <http://www.linkedin.com/in/dmurga/>

[image: Inline image 1]

On Mon, Jul 7, 2014 at 5:53 PM, Sujit Pal  wrote:

> Hi Arjen,
>
> You could also mark a token as "keyword" so the stemmer passes it through
> unchanged. For example, per the Javadocs for PorterStemFilter:
>
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
>
> Note: This filter is aware of the KeywordAttribute
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true
> >.
> To prevent certain terms from being passed to the stemmer
> KeywordAttribute.isKeyword()
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
> >
> should
> be set to true in a previousTokenStream
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true
> >.
> Note: For including the original term as well as the stemmed version, see
> KeywordRepeatFilterFactory
> <
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
> >
>
> Assuming your stemmer is also keyword attribute aware, you could build a
> filter that reads a list of words (such as "vans") that should be protected
> from stemming and marks them with the KeywordAttribute before sending to
> the Porter stemmer and put it into your analysis chain.
>
> -sujit
>
>
> On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao  wrote:
>
> > I think emitting two tokens for "vans" is the right (potentially only)
> way
> > to do it. You could
> > also control the dictionary of terms that require this special treatment.
> >
> > Any reason makes you not happy with this approach?
> >
> > On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden <
> > acmmail...@tweakers.net> wrote:
> >
> > Hello list,
> >
> > We have a fairly large Lucene database for a 30+ million post forum.
> > Users post and search for all kinds of things. To make sure users don't
> > have to type exact matches, we combine a WordDelimiterFilter with a
> > (Dutch) SnowballFilter.
> >
> > Unfortunately users sometimes find examples of words that get stemmed to
> > a word that's basically a stop word. Or reversely, where a very common
> > word is stemmed so that it becomes the same as a rare word.
> >
> > We do index stop words, so theoretically they could still find their
> > result. But when a rare word is stemmed in such a way it yields a
> > million hits, that makes it very unusable...
> >
> > One example is the Dutch word 'van' which is the equivalent of 'of' in
> > English. A user tried to search for the shoe brand 'vans', which gets
> > stemmed to 'van' and obviously gives useless results.
> >
> > I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
> > and 'van' and the StemmerOverrideFilter to try and prevent these cases.
> > Are there any other solutions for these kinds of problems?
> >
> > Best regards,
> >
> > Arjen van der Meijden
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread David Murgatroyd
[apologies for the earlier errant send]

I think
 BooleanQuery bq = new BooleanQuery(false);
doesn't quite accomplish the desired "name IN (dick, rich)" scoring
behavior. This is because (name:dick | name:rich) with coord=false would
score the 'document' "Dick Rich" higher than "Rich" because the former has
two term matches and the latter only one. In contrast, I think the desire
is that one and only one of the terms in the document match those in the
BooleanQuery so that "Rich" would score higher than "Dick Rich", given
document length normalization. It's almost like a desire for
BooleanQuery bq = new BooleanQuery(false);
  bq.set*Maximum*NumberShouldMatch(1);

Is there a good way to accomplish this?

On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir  wrote:

> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies 
> wrote:
> > I see why I'm so confused, but I think I need to construct a simpler
> test case.
> >
> > My top-level BooleanQuery, which has disableCoord=false, has 22
> > clauses. All but three are ordinary SHOULD TermQueries. the remainder
> > are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> > (that's a bug).
> >
> > However, at the end of the explain trace, I see:
> >
> > 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> > flipping coord on and off to see what happens, is somehow not
> > participating at all. So switching it's coord on and off has no
> > effect.
> >
> > Why 20? Why not 22? Is this just an explain quirk?
>
> I am not sure (also not sure i understand your example totally), but
> at the same time could be as simple as the fact you have 2 prohibited
> (MUST_NOT) clauses. These don't count towards coord()
>
> I think its hard to tell from your description (just since it doesn't
> have all the details). an explain or test case or something like that
> would might be more efficient if its still not making sense...
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread David Murgatroyd




On Apr 19, 2012, at 6:36 PM, Benson Margulies  wrote:

> I see why I'm so confused, but I think I need to construct a simpler test 
> case.
> 
> My top-level BooleanQuery, which has disableCoord=false, has 22
> clauses. All but three are ordinary SHOULD TermQueries. the remainder
> are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
> (that's a bug).
> 
> However, at the end of the explain trace, I see:
> 
> 0.45 = coord(9/20) I think that my nested Boolean, for which I've been
> flipping coord on and off to see what happens, is somehow not
> participating at all. So switching it's coord on and off has no
> effect.
> 
> Why 20? Why not 22? Is this just an explain quirk? Should I shove all
> this code up to 3.6 from 2.9.3 before bugging you further?
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org