Stephen Sutherland wrote:

> How do you guys create your web crawler in such a way
> that it would step over bot bait pages like WSPosion? 
> 
> Do you simply include them in a list of urls to avoid
> ? 


If it's a large crawl then this sort of manual involvement is
untenable.  Mind you, there might be considerable mileage in
not crawling or at least massively lowering the priority of
URLs matching "*wpoison*".

In this particular case I'd suggest fetching the page twice
in relatively short succession.  wpoison generated pages will
be massively different -- a sure sign that the web site is
fooling with you.  And you wouldn't have to fetch every page twice.
Since wpoison conveniently generates a practically infinite URL
space, just pick how often you want to check pages and you'll catch
the shifty pages within N attempts.  I've pondered this heuristic
in the general case as a simple method for spotting the variable
part of any web page.

Another idea is to flag "directories" that have too many entries.
The one wpoison script I looked at appeared to generate pages like this:

        .../wpoison/random-word-1
        .../wpoison/random-word-2
        .../wpoison/random-word-3

and so on.  A robot could noice that the ".../wpoison" directory is
very large and therefore should be dropped or at least lowered in
retrieval priority.  And, again, we might generally believe that
large directories are a bad sign (e.g., log directories).


You may be able to detect traps based on content, but that seems a
little dodgy.

I'd hope there are more sophisticated spider traps out there.  wpoison
could stand many improvements.  Even just seeding the random number
generator based on a hash of $PATH_INFO would be a big help.  Not to
mention that the default installation is a huge tip-off.  I guess if
you're really serious you stick these bogus e-mail addresses into your
normal web pages but as invisible text.

                        -- George


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to