Re: [Robots] Safe parameters for spidering Yahoo message headerpages?

2002-08-02 Thread Jeremy C. Reed

On Fri, 2 Aug 2002, Nick Arnett wrote:

> Anyone here figured out what Yahoo will tolerate in terms of spidering its
> message header pages before it blocks the robot's IP address?  Before I
> start testing, I figured I'd see if anyone else here has already done so.
> The duration of the block seems to lengthen, so testing could take a while.
> 
> Sure would be nice if they'd just say what they consider acceptable...

This reminded me of the denial of service attacks that hit them (and
others) maybe a year and a half ago. If I recall, it seems I read that
their routers (or firewalls) were upgraded/configured to stop numerous
connections.

Maybe the spider can be slowed way down (and behave like a normal human
browsing).

  Jeremy C. Reed
echo 'G014AE824B0-07CC?/JJFFFI?D64CB>D=3C427=>;>6HI2>http://www.mccmedia.com/mailman/listinfo/robots



[Robots] Who is "Rumours-Agent"?

2002-07-25 Thread Jeremy C. Reed


I've been crawled a lot by Rumours-Agent of 202.214.69.189 the past few
days.

Does anyone know what they crawl for?

Thanks,

  Jeremy C. Reed
echo '9,J8HD,fDGG8B@?:536FC5=8@I;C5?@H5B0D@5GBIELD54DL>@8L?:5GDEJ8LDG1' |\
sed ss,s50EBsg | tr 0-M 'p.wBt SgiIlxmLhan:o,erDsduv/cyP'





Re: Programmatically controlling a web form

2001-03-08 Thread Jeremy C. Reed

On Thu, 8 Mar 2001, Warhurst, SI (Spencer) wrote:

> I want to write a program a robot to perform a series of actions on a web
> page form (ie: fill in some fields), submit it & then when the resulting
> page is delivered performs further actions. However, I do not know where to
> begin. For example, how do you read a web page displayed in a browser from a
> separate program, and then how do you interact with it from that program?

Actually, you don't interact with it (unless dynamic even after you
received it).

If the web page form is the same everytime, then there is no need to
retrieve it.

But if you do need to fetch the webpage, you can use telnet, wget, lynx,
perl modules, etc. (You didn't tell us what ioperating system you are
using.)

Then you'll need to write a parser to find the form information.

Then you can use telnet or some perl module (or other) to post this
information back using the GET or POST method for example.

> Can anyone point me to a tutorial on this or give me some advice?

I just read an interesting article in April 2001's Dr. Dobbs Journal
(titled "Web Site Searching & Indexing in Perl"). I don't know if it is
available online. If you use perl (it is available for numerous operating
systems), you can use several perl modules to help you. Have a look at
http://www.cpan.org/.

  Jeremy C. Reed
  http://www.reedmedia.net/
  http://bsd.reedmedia.net/  -- BSD news and resources
  http://www.isp-faq.com/-- find answers to your questions




Re: URLs with "?"s in them

2000-03-11 Thread Jeremy C. Reed

On Fri, 10 Mar 2000, Marc Slemko wrote:

> Suppose I have a site.  A fairly large and popular site that is some sort
> of message board type site, with several million or so unique and
> legitimate messages.
>
> Suppose the URLs for the messages are all in the form
> http://site/foo/showme.foo?msgid= where  identififes the message.
>
> Suppose I want common robots to index it, since the messages do
> contain useful content, so it is to their advantage because it
> gives them real and useful content that gives them better results
> than other engines that don't index them, and to my advantage to
> have it indexed since it brings people to the site.

I am guessing that these generated-on-the-fly webpages do not have
changing content. So why don't you have some programmer write a quick
script to locally fetch every one of these pages and resave them as plain
old static html web pages? Your webserver will have a lot less work to do
and your pages can all be indexed by the major search engines.

If the "msgid" (in your example) has a standard format, this should be
quite trivial. Someone could write a routine to do this in just a few
minutes. (Of course, it may take hours to run and to verify the results.)
Plus you could automate the routine to run weekly to convert new dynamic
webpages to static.

  Jeremy C. Reed
  http://www.reedmedia.net
  http://bsd.reedmedia.net