Hey-
I have an indexer at my company that I wrote while back that indexes database
content (users and their profile)...one of the next req. of the project is to
avoid 'spam' in hits. For example if I do a search for oracle, and oracle is in
25 places in someones bio field...and another person has it in one place in his
company field, the 25 places will of course be higher. Unfortunatly, people who
know the system know the more you have certain keywords in you user profile,
the higher you will be on the list. I was thinking I can do one of two things:
1. Work with Lucene algo to lower scores in certain fields (boost in
others)...this would work, but the boost has such a small factor in scoring (or
so it seems), that in some cases it won't matter. (if I boost company to 2.0,
and bio to 1.0 in some cases with xxx hits in bio, that is still first in score)
2. Using Classifier4J (http://classifier4j.sourceforge.net/)...I can use same
idea as a mail filter and use the Bayesian Classifier to train it that certain
words would be spam...then just index the summary. Throwing this out
there...not even sure that it will work...
Not sure if this makses sense...but curious if anyone has ideas, or has done
something like this.
Regards!
-Joe