On Wed, Jul 27, 2011 at 7:15 PM, Pranav Prakash <pra...@gmail.com> wrote: > I guess most of you have already handled and many of you might still be > handling keyword stuffing. Here is my scenario. We have a huge index > containing about 6m docs. (Not sure if that is huge :-) And every document > contains title, description, tags, content (textual data). People have been > doing keyword stuffing on the documents, so when searched for a "query > term", the first results are always the ones who are optimized. > > So, instead of people getting relevant results, they get spam content > (highly optimized, keyword stuffed content) as first few results. I have > tried a couple of things like providing different boosts to different > fields, but almost everything seems to fail. [...]
Presumably, they are doing this by increasing tf (term frequency), i.e., by repeating keywords multiple times. If so, you can use a custom similarity class that caps term frequency, and/or ensures that the scoring increases less than linearly with tf. Please see http://wiki.apache.org/solr/SchemaXml#Similarity , and/or do a web search for more details. Regards, Gora