On 6/11/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> I find in the search results that lots of HTTP 302 pages have been
> indexed. This is decreasing the quality of search results. Is there
> any way to disable indexing such pages?
>
> I want only HTTP 200 OK pages to be indexed.
>

If you run fetcher and parser separately, parser has no way of knowing
what status code the page has returned. Since most 302 pages return
some form of HTML (usually something like "this page will redirect
here") parser assumes that is meaningful HTML and parses it. Fetcher
doesn't have this problem. It only parses pages that return 200.

You can fix this by putting status code in Content's Metadata then
only parsing pages that have status code 200. (or, nutch stores page's
headers in content's metadata. You can check if content's metadata has
a "location" header).

-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to