On 6/11/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote: > I find in the search results that lots of HTTP 302 pages have been > indexed. This is decreasing the quality of search results. Is there > any way to disable indexing such pages? > > I want only HTTP 200 OK pages to be indexed. >
If you run fetcher and parser separately, parser has no way of knowing what status code the page has returned. Since most 302 pages return some form of HTML (usually something like "this page will redirect here") parser assumes that is meaningful HTML and parses it. Fetcher doesn't have this problem. It only parses pages that return 200. You can fix this by putting status code in Content's Metadata then only parsing pages that have status code 200. (or, nutch stores page's headers in content's metadata. You can check if content's metadata has a "location" header). -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
