Re: Not finding links when using HTTPS (httpclient)

Markus Jelsma Fri, 07 Oct 2011 02:40:13 -0700

You're using parse-html, it should extract those relative outlinks just fine. 
Using protocol-httpclient should not make things different. But to rule it 
out, can you parse the page from some other location using protocol-http 
instead?


Do you have any relevant non-default settings on your config?
 
> Dear all,
> 
> I've been trying to crawl and index a https intranet, but the generator
> keeps saying that there are 0 links to be fetched after authenticating and
> parsing the first page. It seems that there's something wrong with the
> parser when used with https (httpclient).
> 
> here's the command that I'm using to reproduce the error:
> 
> bin/nutch org.apache.nutch.parse.ParserChecker http://server/user/library
> 
> cmd output:  http://pastebin.com/h5e7wAZ5
> 
> hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is fetched
> and the contents around line 300)
> 
> Any ideas/help will be appreciated,
> 
> Alfredas

Re: Not finding links when using HTTPS (httpclient)

Reply via email to