[aseek-users] Can't connect to host errors?

Karen Barnes Sat, 26 Oct 2002 20:25:40 -0700

Just wondering how many of you have experience the problem of index displaying the error "can't connect to host" when in fact there is nothing wrong with the host?

Before inserting new URLs to the database I first check to make sure the URL is not already in aspseek's urlword. If it is I move on to the next URL. If not already in the database I use my own crawler I wrote in Perl and LWP. I take the list of URLs I want to add and do a HEAD request for the URL in question. Any URL that does not return a status 200 or is not text/html or text/plain, I ignore. If the URL comes back with a status of 200 in less than 20 seconds I save the URL to a text file for later insertion with aspseek's index.

The problem is that index often returns this error that it can't connect to host. But I can connect with LWP and with the browser on the same machine without any problem. The timeout setting in aspseek.conf is set much higher than that in my Perl script so it's not a timeout problem.

I have been doing calculations on this now for over 4 million URLs and have found roughly 5% of the total URLs aspseek says it can't connect to host, but again, I have no problem using the browser, LWP, DIG or even GET.

So what's so different about index. Why can't it connect?

In addition to this problem I also find that aspseek for some reason or another simply will not for any reason fetch many URLs. They just sit there in urlword with a status of zero. I've tried doing:

./index -N 80 -R 64 -s 0

and if I'm lucky 10 or 20 will be indexed, but it leaves more than 6,000 URLs in "not yet indexed" with a status of zero. There's no reason I can think of why index won't index these URLs. I thought the whole purpose of "-s 0" was to force index to fetch all those with a status of zero. I say either fetch these things or tell me why it won't. Instead they just sit there dead.

Any help with this would be appreciated if anyone knows?

A side note for those who care to know:

I have found many ways to corrupt aspseek even though it shouldn't. Not the mysql tables, but stuff in /usr/local/aspseek/var/aspseek12/. According to the manual page for index:

index -C [-w] [-s status] [-t tag] [-u pattern] [configfile]

But I can corrupt aspseek each and every time doing this.

./index -C -s 0

You can't recover for nothing! I've used myisamchk on all files and everything returned fine. I even tried:

./index -X1
./index -X2
./index -H

and even:

./index -D

which will simply return "Abort" after deleting files via the -s. No matter what I do it stays corrupted! I've duplicated this 3 times with more than 3 million URLs. The ONLY way to recover is to start all over. So that's what I have done and as long as I don't try deleting things using:

./index -C -s (what ever status)

I can keep things working fine. I can delete via URL:

./index -C -u "http://www.somesite.com%";

but that's all I'll do because crawling several million URLs every weekend was getting real old.

Thanks again,
Karen

_________________________________________________________________
Choose an Internet access plan right for you -- try MSN! http://resourcecenter.msn.com/access/plans/default.asp

[aseek-users] Can't connect to host errors?

Reply via email to