On Thursday 01 Nov 2007, Claude Schneegans wrote:
Good, but apparently their site is closed for the time being.
Seems to be running here.
--
Tom Chiverton
Helping to advantageously bully ubiquitous e-business
on: http://thefalken.livejournal.com
On Wednesday 31 Oct 2007, [EMAIL PROTECTED] wrote:
Apparently they publish this very simple to parse black list every day.
Project Honey Pot have a DNS-based blacklist system too.
You construct a hostname based on your API key, the IP to query, and standard
TLD, and if it resolves you don't let
Project Honey Pot have a DNS-based blacklist system too.
Good, but apparently their site is closed for the time being.
It their black list system still working?
--
___
REUSE CODE! Use custom tags;
See http://www.contentbox.com/claude/customtags/tagstore.cfm
On Saturday 27 Oct 2007, [EMAIL PROTECTED] wrote:
So is it really working ?
I use Project Honey Pot's service (hide some bot trap links in pages, and then
they look for spam coming from people who visited those links, subtracting
known good IPs).
I have my own DNS so have donated an MX record
I use Project Honey Pot's service
Thanks, I'll sure have a look.
Though my own system is well enough advanced now.
- I automatically detect robots when they fall in the trap.
- known robots will only see text, no image is displayed.
- I can verify the host and if anything looks suspicious, flag
I don't see the problem, users with Web accelerator get their stuff from
Google's cache, not from my server, so I don't even hear about them.
Anyway, our sites are dynamic, we publish news every day, so robots are
asked to not use cache anyway.
No, Google Web Accelerator doesn't rely on
No, Google Web Accelerator doesn't rely on Google's cache, it prefetches
links from your server.
Are you sure about that ?
It uses client software installed on the user's computer, as well as
data caching on Google's servers...
( http://en.wikipedia.org/wiki/Google_Web_Accelerator )
Sending
If it gets to a page hrefed on a 1 pixel blank image, it
cannot be a human browser.
Sure it can. I can think of two examples off the top of my head - someone
with Google Web Accelerator installed, or someone using Lynx.
Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
Fig Leaf
Who needs to be scanned by oddities like disco/Nutch-0.9
(experimental crawler; [EMAIL PROTECTED])
Nutch is part of Lucene, I think. So, you may well need to be scanned by
that.
Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
Fig Leaf Software provides the highest caliber
Nutch is part of Lucene, I think. So, you may well need to be scanned
by that.
I'll make my mind when they'll put a correct web address in their user
agent,
and I'll be able to see by myself why they are crawling my sites.
the word experimental and just an email address doesn't look too
someone with Google Web Accelerator installed,
I don't see the problem, users with Web accelerator get their stuff from
Google's cache,
not from my server, so I don't even hear about them.
Anyway, our sites are dynamic, we publish news every day, so robots are
asked to not use cache anyway.
Damn, that's a problem. Have you confirmed that the robots have
requested and successfully retrieved robots.txt (perhaps search the
logs for the webserver)?
On 10/27/07, Claude Schneegans [EMAIL PROTECTED] wrote:
Hi,
I tried to implement a bad robot trap, I mean those that do not honor
the
Have you confirmed that the robots have
requested and successfully retrieved robots.txt (perhaps search the
logs for the webserver)?
No, I do not trace reading of robots.txt.
In principle good robot should read and honor it.
Obviously, there is no absolutely good robot.
I use Copernic for
But making a search on the string This page
was illegitimately indexed reveals that most
legitimate robots have found it: Netsacpe,
Google, AOL, Compuserve,... you name it.
So is it really working ?
IMO it is not safe to ban any robot on that only
basis.
I can tell you with absolute
I can tell you with absolute certainty that Google obeys robots.txt.
I'm pretty sure they do.
But we all know that sometimes, an HTTP request is lost somewhere in the
cyber space.
If for any reason the robot does not receive the file, it will probably
act as if there is none.
Only once will
: Saturday, October 27, 2007 5:55 PM
To: CF-Talk
Subject: Re: SOT but... any one using a bot trap?
I can tell you with absolute certainty that Google obeys robots.txt.
I'm pretty sure they do.
But we all know that sometimes, an HTTP request is lost somewhere in the
cyber space.
If for any reason
Define 'bad'.
- bots that disobey robots.txt,
- bots that do not even offer any search service for visitors searching
for you, useless bots,
- bots that just harvest images (just Google picscout AND gettyImages)
and steal a huge amount of your bandwidth,
If I was a 'bad' bot and you blocked
Hi,
I tried to implement a bad robot trap, I mean those that do not honor
the robots.txt file.
Here is the robots.txt file:
User-agent: *
Disallow: /noBots.cfm
Disallow: /bulleltin
Disallow: /admin
in /noBots.cfm, for the time being I just display this:
This page was illegitimately indexed by
18 matches
Mail list logo