Hello
>The usefulness of the single-host spiders is pretty obvious to me.
>But why do people want to write spiders that potentially span all/any hosts?
>(Aside from people who are working for Google or similar.)
>
Maybe the better question is, why do people want to write spiders that
span many
--- Avi Rappoport <[EMAIL PROTECTED]> wrote: >
> At 3:43 PM -0700 3/7/02, Sean M. Burke wrote:
> >The usefulness of the single-host spiders is pretty obvious to me.
> >But why do people want to write spiders that potentially span all/any hosts?
> >(Aside from people who are working for Google o
At 11:31 AM 07/03/02 -0800, Nick Arnett wrote:
>> * Write it in Perl (or equivalent).
>
>I suppose it doesn't help with a book on Perl, but I'm re-writing my robots
>in Python and I'm very happy with the way it's going.
I consider Python to fall under "or equivalent" :)
>> * Consider
People write spiders that potentially span all/any hosts to harvest those email
addresses for the annoying spam, to see if trademarks are being used illegally, to see
if copyrights are
being violated, etc.
> The replies to my request for advice have been very helpful! I'll pick one
> and reply
At 3:43 PM -0700 3/7/02, Sean M. Burke wrote:
>The usefulness of the single-host spiders is pretty obvious to me.
>But why do people want to write spiders that potentially span all/any hosts?
>(Aside from people who are working for Google or similar.)
People think a robot can be an intelligent a
The replies to my request for advice have been very helpful! I'll pick one
and reply to it:
At 10:01 2002-03-07 -0800, Otis Gospodnetic wrote:
>[about my forthcoming book]
>(i.e. I'm a potential customer :)) When will it be published?
It's probably going into tech edit later this month. So i
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Tim Bray
[snip]
> * Write it in Perl (or equivalent).
I suppose it doesn't help with a book on Perl, but I'm re-writing my robots
in Python and I'm very happy with the way it's going. Perform
In <[EMAIL PROTECTED]>, "Sean M. Burke"
<[EMAIL PROTECTED]> writes:
> Aside from basic concepts (don't hammer the server; always obey the
> robots.txt; don't span hosts unless you are really sure that you want to),
> are there any particular bits of wisdom that list members would want me to
> pa
I've found that image maps, framesets, redirects, funky relative
links, JavaScript links and dynamic URLs generated from backend
systems are the main problems with robots. Also bad HTML on pages so
the robot gets confused parsing it, such as unclosed tags.
I have written up a checklist for
That's a curious remark about readers and their misplaced desire for
recursive spiders.
A recursive spider allows its user to drill down into a particular
information domain and
ultimately exhaust it if the spider is capable enough. This is of
enormous benefit to the
information researcher look
Hi Sean,
You might want to consider exploring the "not yet approved" updated
robots.txt standard that covers allow rules and how to apply them to your
spider. This may help raise the level of awareness on the robots.txt
standard. You could also talk about how to use the robots.txt with your
spid
Excellent. I have a copy of Wong's book at home and like that topic
(i.e. I'm a potential customer :)) When will it be published?
I think lots of people do want to know about recursive spiders, and I
bet one of the most frequent obstacles are issues like: queueing, depth
vs. breadth first crawl
> Aside from basic concepts (don't hammer the server; always obey the
> robots.txt; don't span hosts unless you are really sure that you want to),
> are there any particular bits of wisdom that list members would want me to
> pass on to my readers?
Look at http://www.robotstxt.org/wc/guidelin
At 02:51 AM 07/03/02 -0700, Sean M. Burke wrote:
>Aside from basic concepts (don't hammer the server; always obey the
>robots.txt; don't span hosts unless you are really sure that you want to),
>are there any particular bits of wisdom that list members would want me to
>pass on to my readers?
14 matches
Mail list logo