Re: problems extracting outlinks

Sebastian Nagel Wed, 09 Aug 2017 09:48:06 -0700

Hi Carlos,

sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT and 
the call


$ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
  
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio

Could you tell us which Nutch version is used and also which plugins are 
enabled?

Thanks,
Sebastian


On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
> Hi,
> 
> While crawling a site, I found that the crawl stopped before expected
> because lots of urls being downloaded was of the form:
> 
> http://www.domain.com/something/"http://www.domain.com";
> 
> After reading the html of the pages containing that outlinks I found that
> those outlinks are note included in the source code, so I guess there may
> be something incorrect in the page content or in the parse made by nutch.
> How can I know which problem is? I am a little lost with this one.
> 
> In order to see the problem:
> 
> $ bin/nutch parsechecker
> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio
> 
> And within the results we can see this particular outlink:
>  outlink: toUrl:
> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/
> "http://www.seguroscatalanaoccidente.com"; anchor:
> www.seguroscatalanaoccidente.com
> 
> Is there any way to solve or avoid this? maybe with the regex-urlfilter
> file?
> 
> Thanks
> 
> Carlos Pérez Miguel
>

Re: problems extracting outlinks

Reply via email to