Julien Nioche created NUTCH-1666:
------------------------------------

             Summary: Optimisation for BasicURLNormalizer
                 Key: NUTCH-1666
                 URL: https://issues.apache.org/jira/browse/NUTCH-1666
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.7
            Reporter: Julien Nioche
            Priority: Minor
             Fix For: 1.8
         Attachments: NUTCH-1666.patch

The regular expressions in the BasicURLNormalizer are quite costly, the patch 
attached allows to skip the processing if a URL does not contain a sequence of 
interest (two slashes with zero, one or two dots in between).
This reduces the time spent in post processing the parsing quite a bit. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to