Hi again!


ricarDo> thanks in advance for any further help.  I have a web crawler
ricarDo> running in a solaris+apache+mod_perl web server, and for some
ricarDo> reason, when I try go get the contents of a certain page, it
ricarDo> hangs and gives no timeout whatsoever.

>First, are you *really* sure you want to write Yet Another >WebCrawler?

- I'm not writing a web crawler from scratch. the initial source code was
based on Google's, adapted to serve the specific needs of this web site



>Are you following the Robot Exclusion Protocol?

- Yes, unless the problem resides there.

...
$roboturl = "http://$server_host/robots.txt";
$robots_txt = get($roboturl);
$rules->parse($roboturl, $robots_txt);
...
if ($rules->allowed($thisURL))
{
# etcetcetc
...



>Did you read my five or six columns that talk about spiders and >LWP in
<http://www.stonehenge.com/merlyn/WebTechniques/>?

- I'm browsing it right now ;)



>What happens when you visit that URL with a browser?  With "GET" >from the
command line?

- The site is online, (I did a telnet to the server's port 80 and I get a
response)



ricarDo> besides this problem, the crawler is monolithic. can I do a
ricarDo> fork to speed things up? Any suggestions?

>If you're asking questions like this, you need to first study >the existing
art before innovating.

- It seemed like a good idea at first, but then I started to think about net
congestion... any thoughts?


./Ricardo Oliveira

__________________________________________________
FREE Email for ALL! Sign up at http://www.mail.com

Reply via email to