On Wed, Jul 27, 2011 at 7:15 PM, Pranav Prakash <pra...@gmail.com> wrote:
> I guess most of you have already handled and many of you might still be
> handling keyword stuffing. Here is my scenario. We have a huge index
> containing about 6m docs. (Not sure if that is huge :-) And every document
> contains title, description, tags, content (textual data). People have been
> doing keyword stuffing on the documents, so when searched for a "query
> term", the first results are always the ones who are optimized.
>
> So, instead of people getting relevant results, they get spam content
> (highly optimized, keyword stuffed content) as first few results. I have
> tried a couple of things like providing different boosts to different
> fields, but almost everything seems to fail.
[...]

Presumably, they are doing this by increasing tf (term frequency),
i.e., by repeating keywords multiple times. If so, you can use a custom
similarity class that caps term frequency, and/or ensures that the scoring
increases less than linearly with tf. Please see
http://wiki.apache.org/solr/SchemaXml#Similarity , and/or do a web
search for more details.

Regards,
Gora

Reply via email to