Re: Dealing with keyword stuffing
Cool, So I used SweetSpotSimilarity with default params and I see some improvements. However, I could still see some of the 'stuffed' documents coming up in the results. I feel that SweetSpotSimilarity alone is not enough. Going through http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out that there are other things - Pivoted Length Normalization and term frequency normalization that needs fine tuning too. Should I create a custom Similarity Class that overrides all the default behavior? I guess that should help me get more relevant results. Where should I start beginning with it? Pl. do not assume less obvious things, I am still learning !! :-) *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jul 28, 2011 at 17:03, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote: [...] I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated. Replace the existing DefaultSimilarity class in schema.xml (look towards the bottom of the file) with the SweetSpotSimilarity class, e.g., have a line like: similarity class=org.apache.lucene.search.SweetSpotSimilarity/ Regards, Gora
Re: Dealing with keyword stuffing
On Thu, Jul 28, 2011 at 08:31, Chris Hostetter hossman_luc...@fucit.orgwrote: : Presumably, they are doing this by increasing tf (term frequency), : i.e., by repeating keywords multiple times. If so, you can use a custom : similarity class that caps term frequency, and/or ensures that the scoring : increases less than linearly with tf. Please see In some cases, yes they are repeating keywords multiple times. Stuffing different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr Guide. in paticular, using something like SweetSpotSimilarity tuned to know what values make sense for good content in your domain can be useful because it can actaully penalize docsuments that are too short/long or have term freqs that are outside of a reasonble expected range. I am not a Solr expert, But I was thinking in this direction. The ratio of tokens/total_length would be nearer to 1 for a stuffed document, while it would be nearer to 0 for a bogus document. Somewhere between the two lies documents that are more likely to be meaningful. I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated.
Re: Dealing with keyword stuffing
On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote: [...] I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated. Replace the existing DefaultSimilarity class in schema.xml (look towards the bottom of the file) with the SweetSpotSimilarity class, e.g., have a line like: similarity class=org.apache.lucene.search.SweetSpotSimilarity/ Regards, Gora
Re: Dealing with keyword stuffing
On Wed, Jul 27, 2011 at 7:15 PM, Pranav Prakash pra...@gmail.com wrote: I guess most of you have already handled and many of you might still be handling keyword stuffing. Here is my scenario. We have a huge index containing about 6m docs. (Not sure if that is huge :-) And every document contains title, description, tags, content (textual data). People have been doing keyword stuffing on the documents, so when searched for a query term, the first results are always the ones who are optimized. So, instead of people getting relevant results, they get spam content (highly optimized, keyword stuffed content) as first few results. I have tried a couple of things like providing different boosts to different fields, but almost everything seems to fail. [...] Presumably, they are doing this by increasing tf (term frequency), i.e., by repeating keywords multiple times. If so, you can use a custom similarity class that caps term frequency, and/or ensures that the scoring increases less than linearly with tf. Please see http://wiki.apache.org/solr/SchemaXml#Similarity , and/or do a web search for more details. Regards, Gora
Re: Dealing with keyword stuffing
: Presumably, they are doing this by increasing tf (term frequency), : i.e., by repeating keywords multiple times. If so, you can use a custom : similarity class that caps term frequency, and/or ensures that the scoring : increases less than linearly with tf. Please see in paticular, using something like SweetSpotSimilarity tuned to know what values make sense for good content in your domain can be useful because it can actaully penalize docsuments that are too short/long or have term freqs that are outside of a reasonble expected range. FWIW though: that's really just a generic answer to a generic question. the better you understand your data, the better you can configure solr for it -- and that goes equally for the advice people can give you about how to configure solr. you haven't given any information about hte nature of your data: the types of documets, the authoritaive source, the fields involved, where/how/when people edit this data, who is keyword spamming, etc.; or how you wnat to use it: what types of queries you need to support, what your users objectives are, etc. That makes it impossible for anyone to suggest anything but the most general answer customize your Similarity. -Hoss