[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2683. ------------------------------------ Resolution: Implemented > DeduplicationJob: add option to prefer https:// over http:// > ------------------------------------------------------------ > > Key: NUTCH-2683 > URL: https://issues.apache.org/jira/browse/NUTCH-2683 > Project: Nutch > Issue Type: Improvement > Components: crawldb > Affects Versions: 1.15 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.16 > > > The deduplication job allows to keep the shortest URLs as the "best" URL of a > set of duplicates, marking all longer ones as duplicates. Recently search > engines started to penalize non-https pages by [giving https pages a higher > rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] > and [marking http as > insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. > If URLs are identical except for the protocol the deduplication job should be > able to prefer https:// over http:// URLs, although the latter ones are > shorter by one character. Of course, this should be configurable and in > addition to existing preferences (length, score and fetch time) to select the > "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)