one "bad" link on a page kills parsing
--------------------------------------
Key: NUTCH-120
URL: http://issues.apache.org/jira/browse/NUTCH-120
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.7
Environment: ubuntu 5.10
Reporter: Earl Cahill
Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java,
getOutlinks method loops around the whole
while (matcher.contains(input, pattern)) {
...
}
loop, if one url causes an exception, no more links will be extracted.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira