Thanks for your response, Doug. 

> seriously.  We're definitely in need of
> contributions in this area. 
> Some simple page analysis heuristics would be a good
> start.  After that, 
> some link-graph heuristics would probably be useful.

In 1999-2000 I spent time working on heuristics having
to do with repeated keywords, invisible/very small
text, malicious link farms, etc. It was very
frustrating since it helped create a Darwinist
environment for spam. Every round of heuristics caused
spammers to learn what worked from looking at the
results and then reverse-engineering/copy/propagate
those techniques.

Instead of using heuristics for text and formatting
(both on the page and in links), I believe a good
approach could be to implement a features-based
Bayesian filter using spam found by the search engine
as feedback for tuning the probability tables (see
crm114.sourceforge.net, for example). This same
approach could be used to identify languages for
language-restricted queries, maybe textual spam could
be defined as an additional language. It might be
worth experimenting with CRM114 it since the source
code is already there, if it works then a subset of it
could be implemented in Java as a plugin for nutch.

Perhaps the most insidious technique used by spammers
is cloaking (showing a good page to the crawler and a
different, spammy one to the user). For this, the only
solution is distributed crawling using "inconspicuous"
user agent strings (so that the web server cannot tell
the crawler from the ip address range or from the UA
string). Maybe Grub (www.grub.org, acquired last year
by LookSmart but also open source) could be something
to look into.

> You propose an interesting approach which might work
> well.  Are you 
> interested in developing an implementation of it? 

The approach I proposed is one idea that might be
interesting in the future, when a nutch implementation
has millions of daily queries (a good problem to have,
really). I know ideas are a dime a dozen, I'm
interested in implementing something but I don't have
much time and I'm still trying to figure out the
different components of nutch. I'll try to write down
what I learn about nutch and make it available, I
believe that good design documents would lower the
barrier to entry for new developers interested in
contributing.

Diego.

--- Doug Cutting <[EMAIL PROTECTED]>
wrote:
> Diego,
> 
> Indeed, fighting spammers is something that Nutch
> has yet to address 
> 
> Doug
> 
> Diego Basch wrote:
> > I just ran the Nutch tutorial and browsed the
> code. It
> > looks quite impressive, congratulations.
> > 
> > I scanned the mailing list archive and I did not
> find
> > any discussions concerning what I believe is one
> of
> > the most important problems for an open source
> search
> > effort: adversarial information retrieval (i.e.
> > fighting search engine spammers).
> > 
> > I thought about the problem a little. It seems to
> me
> > that making the relevance code available for
> public
> > scrutiny creates an unprecedented opportunity for
> > spammers to have unwanted pages climb to the top
> of
> > the results. 
> > 
> > I think one possible solution to this problem
> could be
> > to implement a moderation system for urls inspired
> by
> > the one used by Slashdot. A possible
> implementation
> > could work as follows:
> > 
> > - N initial meta-moderators give moderation access
> to
> > active members of the community. The job of these
> > moderators is to demote results clearly aimed to
> trick
> > the search engine. I don't see a point in
> promoting
> > good urls because a result that is good for a
> certain
> > query could be mediocre or bad for another one.
> This
> > is best left to the relevance algorithms.
> > 
> > - The cumulative effect of negative points hides a
> url
> > from view unless it is one of very few matches for
> a
> > narrowly-specified query. Beyond a certain
> threshold,
> > a url is mostly (if not completely) invisible and
> can
> > be safely blacklisted.
> > 
> > - Meta-moderators rate the moderations as fair or
> > unfair, which in the long run promotes or demotes
> > moderators.
> > 
> > Some issues:
> > 
> > In order for a scheme like this to work, the
> number of
> > moderators must be large, perhaps on the order of
> one
> > percent of the user base (just a guess).
> Moderators
> > need to be active users of the search engine with
> an
> > incentive to demote bad results whenever they see
> > them, perhaps several times a day.
> > 
> > The number of moderators may pose an
> organizational
> > problem. There must be methods in place to quickly
> > identify and isolate "traitors" and to prevent
> > traitor-bots from running unlimited
> query-moderation
> > operations.
> > 
> > The moderation network must grow with the user
> base.
> > There needs to be a way to add new moderators to
> the
> > system as needed. Perhaps the first [threshold +
> > random number] moderations could be used as a
> > character test (it should not be obvious to the
> > moderator when his/her moderations start to
> count).
> > There could be a ranking of moderation levels
> (from
> > novice to ultra-trusted) that increase the weight
> of a
> > moderator's demotions.
> > 
> > It might be interesting to quantify some
> hypotheses
> > and  run a simulation of this scheme to see if it
> > could be viable (it might be easier to just
> implement
> > it). At the very least, this scheme would wipe out
> the
> > most visible spam since the opportunities for
> > moderation are proportional to the number of
> > impressions of a result.
> > 
> > Slashdot and Wikipedia are two successful examples
> of
> > self-cleaning communities. Maybe Nutch can be
> another
> > one. At least, I hope this idea can generate a
> > discussion on how to deal with AIR.
> > 
> > Diego Basch
> > (search developer, four years at Inktomi, one at
> Wisenut/LookSmart)
> > 
> > __________________________________
> > Do you Yahoo!?
> > Yahoo! SiteBuilder - Free web site building tool.
> Try it!
> > http://webhosting.yahoo.com/ps/sb/
> > 
> > 
> >
>
-------------------------------------------------------
> > The SF.Net email is sponsored by EclipseCon 2004
> > Premiere Conference on Open Tools Development and
> Integration
> > See the breadth of Eclipse activity. February 3-5
> in Anaheim, CA.
> > http://www.eclipsecon.org/osdn
> > _______________________________________________
> > Nutch-developers mailing list
> > [EMAIL PROTECTED]
> >
>
https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 
> 
>
-------------------------------------------------------
> The SF.Net email is sponsored by EclipseCon 2004
> Premiere Conference on Open Tools Development and
> Integration
> See the breadth of Eclipse activity. February 3-5 in
> Anaheim, CA.
> http://www.eclipsecon.org/osdn
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to