Stephen Sutherland wrote:
> How do you guys create your web crawler in such a way > that it would step over bot bait pages like WSPosion? > > Do you simply include them in a list of urls to avoid > ? If it's a large crawl then this sort of manual involvement is untenable. Mind you, there might be considerable mileage in not crawling or at least massively lowering the priority of URLs matching "*wpoison*". In this particular case I'd suggest fetching the page twice in relatively short succession. wpoison generated pages will be massively different -- a sure sign that the web site is fooling with you. And you wouldn't have to fetch every page twice. Since wpoison conveniently generates a practically infinite URL space, just pick how often you want to check pages and you'll catch the shifty pages within N attempts. I've pondered this heuristic in the general case as a simple method for spotting the variable part of any web page. Another idea is to flag "directories" that have too many entries. The one wpoison script I looked at appeared to generate pages like this: .../wpoison/random-word-1 .../wpoison/random-word-2 .../wpoison/random-word-3 and so on. A robot could noice that the ".../wpoison" directory is very large and therefore should be dropped or at least lowered in retrieval priority. And, again, we might generally believe that large directories are a bad sign (e.g., log directories). You may be able to detect traps based on content, but that seems a little dodgy. I'd hope there are more sophisticated spider traps out there. wpoison could stand many improvements. Even just seeding the random number generator based on a hash of $PATH_INFO would be a big help. Not to mention that the default installation is a huge tip-off. I guess if you're really serious you stick these bogus e-mail addresses into your normal web pages but as invisible text. -- George -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".