Hey-
 
I have an indexer at my company that I wrote while back that indexes database 
content (users and their profile)...one of the next req. of the project is to 
avoid 'spam' in hits. For example if I do a search for oracle, and oracle is in 
25 places in someones bio field...and another person has it in one place in his 
company field, the 25 places will of course be higher. Unfortunatly, people who 
know the system know the more you have certain keywords in you user profile, 
the higher you will be on the list. I was thinking I can do one of two things:
 
1. Work with Lucene algo to lower scores in certain fields (boost in 
others)...this would work, but the boost has such a small factor in scoring (or 
so it seems), that in some cases it won't matter. (if I boost company to 2.0, 
and bio to 1.0 in some cases with xxx hits in bio, that is still first in score)
 
2. Using Classifier4J (http://classifier4j.sourceforge.net/)...I can use same 
idea as a mail filter and use the Bayesian Classifier to train it that certain 
words would be spam...then just index the summary. Throwing this out 
there...not even sure that it will work...
 
Not sure if this makses sense...but curious if anyone has ideas, or has done 
something like this.
 
Regards!
-Joe

Reply via email to