Just wondering how many of you have experience the problem of index
displaying the error "can't connect to host" when in fact there is nothing
wrong with the host?
Before inserting new URLs to the database I first check to make sure the URL
is not already in aspseek's urlword. If it is I move on to the next URL. If
not already in the database I use my own crawler I wrote in Perl and LWP. I
take the list of URLs I want to add and do a HEAD request for the URL in
question. Any URL that does not return a status 200 or is not text/html or
text/plain, I ignore. If the URL comes back with a status of 200 in less
than 20 seconds I save the URL to a text file for later insertion with
aspseek's index.
The problem is that index often returns this error that it can't connect to
host. But I can connect with LWP and with the browser on the same machine
without any problem. The timeout setting in aspseek.conf is set much higher
than that in my Perl script so it's not a timeout problem.
I have been doing calculations on this now for over 4 million URLs and have
found roughly 5% of the total URLs aspseek says it can't connect to host,
but again, I have no problem using the browser, LWP, DIG or even GET.
So what's so different about index. Why can't it connect?
In addition to this problem I also find that aspseek for some reason or
another simply will not for any reason fetch many URLs. They just sit there
in urlword with a status of zero. I've tried doing:
./index -N 80 -R 64 -s 0
and if I'm lucky 10 or 20 will be indexed, but it leaves more than 6,000
URLs in "not yet indexed" with a status of zero. There's no reason I can
think of why index won't index these URLs. I thought the whole purpose of
"-s 0" was to force index to fetch all those with a status of zero. I say
either fetch these things or tell me why it won't. Instead they just sit
there dead.
Any help with this would be appreciated if anyone knows?
A side note for those who care to know:
I have found many ways to corrupt aspseek even though it shouldn't. Not the
mysql tables, but stuff in /usr/local/aspseek/var/aspseek12/. According to
the manual page for index:
index -C [-w] [-s status] [-t tag] [-u pattern] [configfile]
But I can corrupt aspseek each and every time doing this.
./index -C -s 0
You can't recover for nothing! I've used myisamchk on all files and
everything returned fine. I even tried:
./index -X1
./index -X2
./index -H
and even:
./index -D
which will simply return "Abort" after deleting files via the -s. No matter
what I do it stays corrupted! I've duplicated this 3 times with more than 3
million URLs. The ONLY way to recover is to start all over. So that's what I
have done and as long as I don't try deleting things using:
./index -C -s (what ever status)
I can keep things working fine. I can delete via URL:
./index -C -u "http://www.somesite.com%"
but that's all I'll do because crawling several million URLs every weekend
was getting real old.
Thanks again,
Karen
_________________________________________________________________
Choose an Internet access plan right for you -- try MSN!
http://resourcecenter.msn.com/access/plans/default.asp
- Re: [aseek-users] Can't connect to host errors? Karen Barnes
- Re: [aseek-users] Can't connect to host errors? Matt Sullivan
- Re: [aseek-users] Can't connect to host errors? Karen Barnes
- [aseek-users] running index Emin Huseynov
- Re: [aseek-users] running index brian pikkaart
- Re: [aseek-users] Can't connect to host erro... Matt Sullivan
- Re: [aseek-users] Can't connect to host ... Matt Sullivan
- [aseek-users] DeleteNoServer and sta... Ing. Ernesto Rapetti
- Re: [aseek-users] Can't connect to host errors? Karen Barnes
