On Thursday 01 Nov 2007, Claude Schneegans wrote:
> Good, but apparently their site is closed for the time being.
Seems to be running here.
--
Tom Chiverton
Helping to advantageously bully ubiquitous e-business
on: http://thefalken.livejournal.com
***
>>Project Honey Pot have a DNS-based blacklist system too.
Good, but apparently their site is closed for the time being.
It their black list system still working?
--
___
REUSE CODE! Use custom tags;
See http://www.contentbox.com/claude/customtags/tagstore.cfm
On Wednesday 31 Oct 2007, [EMAIL PROTECTED] wrote:
> Apparently they publish this very simple to parse black list every day.
Project Honey Pot have a DNS-based blacklist system too.
You construct a hostname based on your API key, the IP to query, and standard
TLD, and if it resolves you don't let
>>I use Project Honey Pot's service
Thanks, I'll sure have a look.
Though my own system is well enough advanced now.
- I automatically detect robots when they fall in the trap.
- known robots will only see text, no image is displayed.
- I can verify the host and if anything looks suspicious, flag
On Saturday 27 Oct 2007, [EMAIL PROTECTED] wrote:
> So is it really working ?
I use Project Honey Pot's service (hide some bot trap links in pages, and then
they look for spam coming from people who visited those links, subtracting
known good IPs).
I have my own DNS so have donated an MX record
>>No, Google Web Accelerator doesn't rely on Google's cache, it prefetches
links from your server.
Are you sure about that ?
"It uses client software installed on the user's computer, as well as
data caching on Google's servers..."
( http://en.wikipedia.org/wiki/Google_Web_Accelerator )
"Sending
> I don't see the problem, users with Web accelerator get their stuff from
> Google's cache, not from my server, so I don't even hear about them.
> Anyway, our sites are dynamic, we publish news every day, so robots are
> asked to not use cache anyway.
No, Google Web Accelerator doesn't rely on G
>>someone with Google Web Accelerator installed,
I don't see the problem, users with Web accelerator get their stuff from
Google's cache,
not from my server, so I don't even hear about them.
Anyway, our sites are dynamic, we publish news every day, so robots are
asked to not use cache anyway.
>>Nutch is part of Lucene, I think. So, you may well need to be scanned
by that.
I'll make my mind when they'll put a correct web address in their user
agent,
and I'll be able to see by myself why they are crawling my sites.
the word "experimental" and just an email address doesn't look too ser
> Who needs to be scanned by oddities like "disco/Nutch-0.9
> (experimental crawler; [EMAIL PROTECTED])"
Nutch is part of Lucene, I think. So, you may well need to be scanned by
that.
Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
Fig Leaf Software provides the highest caliber vendo
> If it gets to a page hrefed on a 1 pixel blank image, it
> cannot be a human browser.
Sure it can. I can think of two examples off the top of my head - someone
with Google Web Accelerator installed, or someone using Lynx.
Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
Fig Leaf Sof
>>Define 'bad'.
- bots that disobey robots.txt,
- bots that do not even offer any search service for visitors searching
for you, useless bots,
- bots that just harvest images (just Google picscout AND gettyImages)
and steal a huge amount of your bandwidth,
>>If I was a 'bad' bot and you block
ans [mailto:[EMAIL PROTECTED]
Sent: Saturday, October 27, 2007 5:55 PM
To: CF-Talk
Subject: Re: SOT but... any one using a bot trap?
>>I can tell you with absolute certainty that Google obeys robots.txt.
I'm pretty sure they do.
But we all know that sometimes, an HTTP request is lost somewh
>>I can tell you with absolute certainty that Google obeys robots.txt.
I'm pretty sure they do.
But we all know that sometimes, an HTTP request is lost somewhere in the
cyber space.
If for any reason the robot does not receive the file, it will probably
act as if there is none.
Only once will s
> But making a search on the string "This page
> was illegitimately indexed" reveals that most
> legitimate robots have found it: Netsacpe,
> Google, AOL, Compuserve,... you name it.
>
> So is it really working ?
> IMO it is not safe to ban any robot on that only
> basis.
I can tell you with a
>>Have you confirmed that the robots have
requested and successfully retrieved robots.txt (perhaps search the
logs for the webserver)?
No, I do not trace reading of robots.txt.
In principle good robot should read and honor it.
Obviously, there is no absolutely good robot.
I use Copernic for sear
Damn, that's a problem. Have you confirmed that the robots have
requested and successfully retrieved robots.txt (perhaps search the
logs for the webserver)?
On 10/27/07, Claude Schneegans <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I tried to implement a bad robot trap, I mean those that do not honor
> t
17 matches
Mail list logo