[ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358466 ]
Earl Cahill commented on NUTCH-120: ----------------------------------- I can't really explain what was happening, but for a time, many valid links would throw an exception. Then it just stopped. I think we don't really know what is going on in the code. LIke, what really causes an exception to get thrown? I don't see the possibility for an infinite loop. I for one still don't trust that links that throw an exception are really problematic, and think that having one such link shouldn't stop parsing. I am guessing that failed links aren't recorded or generally reviewed, so I see this as a place that parsing and crawling could fail and it would be pretty hard to track down. Just seems a little too unforgiving. > one "bad" link on a page kills parsing > -------------------------------------- > > Key: NUTCH-120 > URL: http://issues.apache.org/jira/browse/NUTCH-120 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7 > Environment: ubuntu 5.10 > Reporter: Earl Cahill > > Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, > getOutlinks method loops around the whole > while (matcher.contains(input, pattern)) { > ... > } > loop, if one url causes an exception, no more links will be extracted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
