Indexer should only index pages with fetch status SUCCESS
---------------------------------------------------------
Key: NUTCH-514
URL: https://issues.apache.org/jira/browse/NUTCH-514
Project: Nutch
Issue Type: Improvement
Components: indexer
Reporter: Doğacan Güney
Priority: Minor
Fix For: 1.0.0
Currently if you parse during fetch, nutch only parses pages which are
successfully (i.e, have a status STATUS_FETCH_SUCCESS). But, if you run parse
as a seperate job, nutch parses pages like "404 not found"s or "301 moved"s.
Since most of these can be successfully parsed these are indexed and show up in
search results.
IMO, we should either somehow mark contents so that a separate parse doesn't
output non-STATUS_FETCH_SUCCESS pages or we should filter them out in Indexer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers