[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

Earl Cahill (JIRA) Thu, 24 Nov 2005 10:50:16 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358466 ]


Earl Cahill commented on NUTCH-120:
-----------------------------------

I can't really explain what was happening, but for a time, many valid links 
would throw an exception.  Then it just stopped.  I think we don't really know 
what is going on in the code.  LIke, what really causes an exception to get 
thrown?  I don't see the possibility for an infinite loop.

I for one still don't trust that links that throw an exception are really 
problematic, and think that having one such link shouldn't stop parsing.  I am 
guessing that failed links aren't recorded or generally reviewed, so I see this 
as a place that parsing and crawling could fail and it would be pretty hard to 
track down.  Just seems a little too unforgiving.

> one "bad" link on a page kills parsing
> --------------------------------------
>
>          Key: NUTCH-120
>          URL: http://issues.apache.org/jira/browse/NUTCH-120
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>  Environment: ubuntu 5.10
>     Reporter: Earl Cahill

>
> Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, 
> getOutlinks method loops around the whole
> while (matcher.contains(input, pattern)) {
> ...
> }
> loop, if one url causes an exception, no more links will be extracted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

Reply via email to