In a previous post I made "to the wrong list", I asked this question which
Kir politely answered:
How does one clean things up. Here's my example of real data:
ASPseek database statistics
Status Expired Total
-----------------------------
0 211 211 Not indexed yet
200 0 4738 OK
301 0 129 Moved Permanently
302 0 311 Moved Temporarily
403 0 5 Forbidden
404 0 2902 Not found
-----------------------------
Total 211 8296
Kir's answer:
If you want to index not-indexed-yet URLs (status 0), use
index -s 0
OK I can understand this and it does indeed work for the reindexing. But now
I have another question on these same lines. You'll notice that adding up
all the URLs in the NON 200 status is roughly 50% of the total URLs. OK so
it doesn't take up much space, but....
Most likely all those 404 Not Found URLs (2,902 of them) will never be found
because they have "removed" them from their server. These are all dead
links. The way I see it, aspseek (index) will try to fetch them again when
their index time is due. Why go through all this if these pages don't exist
anyway. No sense in asking for something we know isn't there. That MUST take
unecessary resources.
So my question is can I do this without fear of breaking aspseek?
index -C -s 404
index -C -s 403
index -C -s 301
index -C -s 302
and if I don't want to keep trying to get status 0 (probably DNS timeouts
which I don't want to wait around for anyway)
index -C -s 0
which will now leave me with only status 200 URLs.
If the above will work do I then need to run this:
index -X1
index -X2
index -H
then from a mysql prompt do:
OPTIMZE TABLE urlword;
will this effectively remove all these and at the same time not break
aspseek? Is the order of operation above correct?
My total index will be about 4 million URLs when done. If roughly 50% of
them are non 200 status I can't see trying to reindex 2 million URLs that
will never be fetched anyway. I don't care if these non 200 URLs ever make
it to the database anyway.
Thanks a million for your help!
_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx