Sebastian Nagel created NUTCH-2335:
--------------------------------------

             Summary: Injector not to filter and normalize existing URLs in 
CrawlDb
                 Key: NUTCH-2335
                 URL: https://issues.apache.org/jira/browse/NUTCH-2335
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb, injector
    Affects Versions: 1.12
            Reporter: Sebastian Nagel
             Fix For: 1.13


With NUTCH-1712 the behavior of the Injector has changed in case new URLs are 
added to an existing CrawlDb:
- before only injected URLs were filtered and normalized
- now filters and normalizers are applied to all URLs including those already 
in the CrawlDb

The default should be as before not to filter existing URLs. Filtering and 
normalizing may take long for large CrawlDbs and/or complex URL filters. If URL 
filter or normalizer rules are not changed there is no need to apply them anew 
every time new URLs are added. Of course, injected URLs should be filtered and 
normalized by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to