Deduplicate anchors before indexing
-----------------------------------

                 Key: NUTCH-1037
                 URL: https://issues.apache.org/jira/browse/NUTCH-1037
             Project: Nutch
          Issue Type: Improvement
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.4, 2.0


Anchors are not deduplicated before indexing. This can result in a very high 
number of similar and identical anchors being indexed. Before indexing, anchors 
must be deduplicated at least on case.

Should this be implemented as a fix or as a new feature that needs to be 
configured?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to