At 11:31 AM 07/03/02 -0800, Nick Arnett wrote:
>> * Write it in Perl (or equivalent).
>
>I suppose it doesn't help with a book on Perl, but I'm re-writing my robots
>in Python and I'm very happy with the way it's going.
I consider Python to fall under "or equivalent" :)
>> * Consider
If you tried to send mail to either the robots or km list here in the last
couple of hours, they may have been swallowed by the Black Hole of Ecartis.
I was switching from Listar to the latest build of Ecartis (its new name)
and got into a small nightmare of symbolic links and permissions.
So, i
People write spiders that potentially span all/any hosts to harvest those email
addresses for the annoying spam, to see if trademarks are being used illegally, to see
if copyrights are
being violated, etc.
> The replies to my request for advice have been very helpful! I'll pick one
> and reply
At 3:43 PM -0700 3/7/02, Sean M. Burke wrote:
>The usefulness of the single-host spiders is pretty obvious to me.
>But why do people want to write spiders that potentially span all/any hosts?
>(Aside from people who are working for Google or similar.)
People think a robot can be an intelligent a
The replies to my request for advice have been very helpful! I'll pick one
and reply to it:
At 10:01 2002-03-07 -0800, Otis Gospodnetic wrote:
>[about my forthcoming book]
>(i.e. I'm a potential customer :)) When will it be published?
It's probably going into tech edit later this month. So i
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Tim Bray
[snip]
> * Write it in Perl (or equivalent).
I suppose it doesn't help with a book on Perl, but I'm re-writing my robots
in Python and I'm very happy with the way it's going. Perform
In <[EMAIL PROTECTED]>, "Sean M. Burke"
<[EMAIL PROTECTED]> writes:
> Aside from basic concepts (don't hammer the server; always obey the
> robots.txt; don't span hosts unless you are really sure that you want to),
> are there any particular bits of wisdom that list members would want me to
> pa
I've found that image maps, framesets, redirects, funky relative
links, JavaScript links and dynamic URLs generated from backend
systems are the main problems with robots. Also bad HTML on pages so
the robot gets confused parsing it, such as unclosed tags.
I have written up a checklist for
That's a curious remark about readers and their misplaced desire for
recursive spiders.
A recursive spider allows its user to drill down into a particular
information domain and
ultimately exhaust it if the spider is capable enough. This is of
enormous benefit to the
information researcher look
Hi Sean,
You might want to consider exploring the "not yet approved" updated
robots.txt standard that covers allow rules and how to apply them to your
spider. This may help raise the level of awareness on the robots.txt
standard. You could also talk about how to use the robots.txt with your
spid
Excellent. I have a copy of Wong's book at home and like that topic
(i.e. I'm a potential customer :)) When will it be published?
I think lots of people do want to know about recursive spiders, and I
bet one of the most frequent obstacles are issues like: queueing, depth
vs. breadth first crawl
> Aside from basic concepts (don't hammer the server; always obey the
> robots.txt; don't span hosts unless you are really sure that you want to),
> are there any particular bits of wisdom that list members would want me to
> pass on to my readers?
Look at http://www.robotstxt.org/wc/guidelin
At 02:51 AM 07/03/02 -0700, Sean M. Burke wrote:
>Aside from basic concepts (don't hammer the server; always obey the
>robots.txt; don't span hosts unless you are really sure that you want to),
>are there any particular bits of wisdom that list members would want me to
>pass on to my readers?
Hi all!
My name is Sean Burke, and I'm writing a book for O'Reilly, which is to
basically replace the Clinton Wong's now out-of-print /Web Client
Programming with Perl/. In my book draft so far, I haven't discussed
actual recursive spiders (I've only discussed getting a given page, and
then
14 matches
Mail list logo