[ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2683.
------------------------------------
    Resolution: Implemented

> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
>                 Key: NUTCH-2683
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to