Hi Carlos, sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT and the call
$ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \ https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio Could you tell us which Nutch version is used and also which plugins are enabled? Thanks, Sebastian On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote: > Hi, > > While crawling a site, I found that the crawl stopped before expected > because lots of urls being downloaded was of the form: > > http://www.domain.com/something/"http://www.domain.com" > > After reading the html of the pages containing that outlinks I found that > those outlinks are note included in the source code, so I guess there may > be something incorrect in the page content or in the parse made by nutch. > How can I know which problem is? I am a little lost with this one. > > In order to see the problem: > > $ bin/nutch parsechecker > https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio > > And within the results we can see this particular outlink: > outlink: toUrl: > https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/ > "http://www.seguroscatalanaoccidente.com" anchor: > www.seguroscatalanaoccidente.com > > Is there any way to solve or avoid this? maybe with the regex-urlfilter > file? > > Thanks > > Carlos Pérez Miguel >