In a previous post I made "to the wrong list", I asked this question which
Kir politely answered:
How does one clean things up. Here's my example of real data:
ASPseek database statistics
Status Expired Total
-----------------------------
0 211 211 Not indexed yet
200 0 4738 OK
301 0 129 Moved Permanently
302 0 311 Moved Temporarily
403 0 5 Forbidden
404 0 2902 Not found
-----------------------------
Total 211 8296
Kir's answer:
If you want to index not-indexed-yet URLs (status 0), use index -s 0
OK I can understand this and it does indeed work for the reindexing. But now
I have another question on these same lines. You'll notice that adding up
all the URLs in the NON 200 status is roughly 50% of the total URLs. OK so
it doesn't take up much space, but....
Most likely all those 404 Not Found URLs (2,902 of them) will never be found
because they have "removed" them from their server. These are all dead
links. The way I see it, aspseek (index) will try to fetch them again when
their index time is due. Why go through all this if these pages don't exist
anyway. No sense in asking for something we know isn't there. That MUST take
unecessary resources.
So my question is can I do this without fear of breaking aspseek?
index -C -s 404
index -C -s 403
index -C -s 301
index -C -s 302
and if I don't want to keep trying to get status 0 (probably DNS timeouts
which I don't want to wait around for anyway)
index -C -s 0
which will now leave me with only status 200 URLs.
If the above will work do I then need to run this:
index -X1
index -X2
index -H
then from a mysql prompt do:
OPTIMZE TABLE urlword;
will this effectively remove all these and at the same time not break
aspseek? Is the order of operation above correct?
My total index will be about 4 million URLs when done. If roughly 50% of
them are non 200 status I can't see trying to reindex 2 million URLs that
will never be fetched anyway. I don't care if these non 200 URLs ever make
it to the database anyway.
Thanks a million for your help!
Stats as of today looks like this. I would really like to clean things up if
possible:
Status Expired Total
-----------------------------
0 56254 56280 Not indexed yet
1 0 125 Unknown status
200 0 1737144 OK
202 0 60 Unknown status
204 0 41 No content
205 0 1 Unknown status
300 0 14 Multiple Choices
301 0 54949 Moved Permanently
302 0 104560 Moved Temporarily
303 0 7 See Other
307 0 8 Unknown status
400 0 87 Bad Request
401 0 155 Unauthorized
402 0 3 Payment Required
403 0 2760 Forbidden
404 0 999505 Not found
405 0 1 Method Not Allowed
407 0 1 Proxy Authentication Required
408 0 5 Request Timeout
410 0 9 Gone
415 0 1 Unsupported Media Type
500 0 318 Internal Server Error
501 0 8 Not Implemented
502 0 11 Bad Gateway
503 0 318 Service Unavailable
504 0 7 Gateway Timeout
508 0 262 Unknown status
-----------------------------
Total 56254 2956640
_________________________________________________________________
Join the world�s largest e-mail service with MSN Hotmail.
http://www.hotmail.com