Re: ttp vs https duplicate fetches - host-urlnormalize?

Sebastian Nagel Sat, 05 Mar 2016 12:49:31 -0800

Hi Arthur,

this problem has been recently discussed in
  https://issues.apache.org/jira/browse/NUTCH-2065
and addressed by urlnormalizer-protocol
  https://issues.apache.org/jira/browse/NUTCH-2190


Of course, you have to decide for every host
which protocol shall be used.

Cheers,
Sebastian


On 03/04/2016 08:50 PM, Arthur Yarwood wrote:
> I have recently discovered my crawl had a fetched a number of sites in 
> duplicate - once over http,
> and again over https. In a  similar manner one can add a host to the 
> host-urlnormlize file to avoid
> a similar issue with www.example.com vs example.com urls - is there a tactic 
> to address http vs https?
> 
> Ideally always favouring http over https (for efficiency), but not totally 
> discounting https
> totally, if an entire host is setup to always serve over https. i.e. I don't 
> really want to block
> all https hosts via a regex-urlfilter.
> 
> I have worked around it to some degree via specific regex-urlfilters, but it 
> would be nice if there
> was a global option, rather than have to tweak config everytime I discover 
> duplicate content in my
> crawl.
>

Re: ttp vs https duplicate fetches - host-urlnormalize?

Reply via email to