Impose a limit on the length of outlink target urls
---------------------------------------------------

                 Key: NUTCH-1314
                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
             Project: Nutch
          Issue Type: Improvement
            Reporter: Ferdy Galema
         Attachments: NUTCH-1314.patch

In the past we have encountered situations where crawling specific broken sites 
resulted in ridiciously long urls that caused the stalling of tasks. The regex 
plugins (normalizing/filtering) processed single urls for hours, if not 
indefinitely hanging.

My suggestion is to limit the outlink url target length as soon possible. It is 
a configurable limit, the default is 3000. This should be reasonably long 
enough for most uses. But sufficienly strict enough to make sure regex plugins 
do not choke on urls that are too long. Please see attached patch for the 
Nutchgora implementation.

I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to