[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846932#action_12846932 ]
Andrzej Bialecki commented on NUTCH-802: ----------------------------------------- We already have a general way to control this and other aspects of URL-s as such, namely with URLFilters. I agree that this functionality could be useful, but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or urlfilter-validator). > Problems managing outlinks with large url length > ------------------------------------------------ > > Key: NUTCH-802 > URL: https://issues.apache.org/jira/browse/NUTCH-802 > Project: Nutch > Issue Type: Bug > Components: parser > Reporter: Pablo Aragón > Assignee: Andrzej Bialecki > Attachments: ParseOutputFormat.patch > > > Nutch can get idle during the collection of outlinks if the URL address of > the outlink is too large. > The maximum sizes of an URL for the main web servers are: > * Apache: 4,000 bytes > * Microsoft Internet Information Server (IIS): 16, 384 bytes > * Perl HTTP::Daemon: 8.000 bytes > URL adress sizes bigger than 4000 bytes are problematic, so the limit should > be set in the nutch-default.xml configuration file. > I attached a patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.