In a previous post I made "to the wrong list", I asked this question which 
Kir politely answered:

How does one clean things up. Here's my example of real data:

ASPseek database statistics

  Status    Expired      Total
-----------------------------
       0        211        211 Not indexed yet
     200          0       4738 OK
     301          0        129 Moved Permanently
     302          0        311 Moved Temporarily
     403          0          5 Forbidden
     404          0       2902 Not found
-----------------------------
   Total        211       8296

Kir's answer:

If you want to index not-indexed-yet URLs (status 0), use index -s 0

OK I can understand this and it does indeed work for the reindexing. But now 
I have another question on these same lines. You'll notice that adding up 
all the URLs in the NON 200 status is roughly 50% of the total URLs. OK so 
it doesn't take up much space, but....

Most likely all those 404 Not Found URLs (2,902 of them) will never be found 
because they have "removed" them from their server. These are all dead 
links. The way I see it, aspseek (index) will try to fetch them again when 
their index time is due. Why go through all this if these pages don't exist 
anyway. No sense in asking for something we know isn't there. That MUST take 
unecessary resources.

So my question is can I do this without fear of breaking aspseek?

index -C -s 404
index -C -s 403
index -C -s 301
index -C -s 302

and if I don't want to keep trying to get status 0 (probably DNS timeouts 
which I don't want to wait around for anyway)

index -C -s 0

which will now leave me with only status 200 URLs.

If the above will work do I then need to run this:

index -X1
index -X2
index -H

then from a mysql prompt do:

OPTIMZE TABLE urlword;

will this effectively remove all these and at the same time not break 
aspseek? Is the order of operation above correct?

My total index will be about 4 million URLs when done. If roughly 50% of 
them are non 200 status I can't see trying to reindex 2 million URLs that 
will never be fetched anyway. I don't care if these non 200 URLs ever make 
it to the database anyway.

Thanks a million for your help!

Stats as of today looks like this. I would really like to clean things up if 
possible:

Status    Expired      Total
   -----------------------------
         0      56254      56280 Not indexed yet
         1          0        125 Unknown status
       200          0    1737144 OK
       202          0         60 Unknown status
       204          0         41 No content
       205          0          1 Unknown status
       300          0         14 Multiple Choices
       301          0      54949 Moved Permanently
       302          0     104560 Moved Temporarily
       303          0          7 See Other
       307          0          8 Unknown status
       400          0         87 Bad Request
       401          0        155 Unauthorized
       402          0          3 Payment Required
       403          0       2760 Forbidden
       404          0     999505 Not found
       405          0          1 Method Not Allowed
       407          0          1 Proxy Authentication Required
       408          0          5 Request Timeout
       410          0          9 Gone
       415          0          1 Unsupported Media Type
       500          0        318 Internal Server Error
       501          0          8 Not Implemented
       502          0         11 Bad Gateway
       503          0        318 Service Unavailable
       504          0          7 Gateway Timeout
       508          0        262 Unknown status
   -----------------------------
     Total      56254    2956640


_________________________________________________________________
Join the world�s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com

Reply via email to