Bug: the protocol.getProtocolOutput() for httpclient "protocol" returns empty content....
Alfredas On Fri, Oct 7, 2011 at 11:58 AM, Alfredas Chmieliauskas < [email protected]> wrote: > I've copied the same page on non-https location and changed > the protocol-httpclient to protocol-http. And the parser found 18 outlinks. > So it seems that the problem is with the httpclient... > > Thanks Markus, > > A > > > On Fri, Oct 7, 2011 at 11:36 AM, Markus Jelsma <[email protected] > > wrote: > >> You're using parse-html, it should extract those relative outlinks just >> fine. >> Using protocol-httpclient should not make things different. But to rule it >> out, can you parse the page from some other location using protocol-http >> instead? >> >> Do you have any relevant non-default settings on your config? >> >> > Dear all, >> > >> > I've been trying to crawl and index a https intranet, but the generator >> > keeps saying that there are 0 links to be fetched after authenticating >> and >> > parsing the first page. It seems that there's something wrong with the >> > parser when used with https (httpclient). >> > >> > here's the command that I'm using to reproduce the error: >> > >> > bin/nutch org.apache.nutch.parse.ParserChecker >> http://server/user/library >> > >> > cmd output: http://pastebin.com/h5e7wAZ5 >> > >> > hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is >> fetched >> > and the contents around line 300) >> > >> > Any ideas/help will be appreciated, >> > >> > Alfredas >> > >

