RE: Relevancy : Keyword stuffing

Markus Jelsma Mon, 16 Mar 2015 15:07:41 -0700

Hello - Chris' suggestion is indeed a good one but it can be tricky to properly 
configure the parameters. Regarding position information, you can override 
dismax to have it use SpanFirstQuery. It allows for setting strict boundaries 
from the front of the document to a given position. You can also override 
SpanFirstQuery to incorporate a gradient, to decrease boosting as distance from 
the front increases.


I don't know how you ingest document bodies, but if they are unstructured HTML, 
you may want to install proper main content extraction if you haven't already. 
Having decent control over HTML is a powerful tool.

You may also want to look at Lucene's BM25 implementation. It is simple to set 
up and easier to control. It isn't as rough a tool as TFIDF is regarding to 
length normalization. Plus it allows you to smooth TF, which in your case 
should also help.

If you like to scrutinize SSS and get some proper results, you are more than 
welcome to share them here :)

Markus
 
-----Original message-----
> From:Mihran Shahinian <slowmih...@gmail.com>
> Sent: Monday 16th March 2015 22:41
> To: solr-user@lucene.apache.org
> Subject: Re: Relevancy : Keyword stuffing
> 
> Thank you Markus and Chris, for pointers.
> For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
> exposed via similarity config is easier to maintain as data changes than
> making adjustments to fit a
> function. Another piece of info would've been handy is to know the average
> position info + position info for the first few occurrences for each term.
> This would allow
> perhaps higher boosting for term occurrences earlier in the doc. In my case
> extra keywords are towards the end of the doc,but that info does not seem
> to be propagated into scorer.
> Thanks again,
> Mihran
> 
> 
> 
> On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter <hossman_luc...@fucit.org>
> wrote:
> 
> >
> > You should start by checking out the "SweetSpotSimilarity" .. it was
> > heavily designed arround the idea of dealing with things like excessively
> > verbose titles, and keyword stuffing in summary text ... so you can
> > configure your expectation for what a "normal" length doc is, and they
> > will be penalized for being longer then that.  similarly you can say what
> > a 'resaonable' tf is, and docs that exceed that would't get added boost
> > (which in conjunction with teh lengthNorm penality penalizes docs that
> > stuff keywords)
> >
> >
> > https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
> >
> >
> > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
> >
> > https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
>

RE: Relevancy : Keyword stuffing

Reply via email to