Re: Not finding links when using HTTPS (httpclient)

Alfredas Chmieliauskas Fri, 07 Oct 2011 07:52:56 -0700

Bug: the protocol.getProtocolOutput() for httpclient "protocol" returns
empty content....


Alfredas



On Fri, Oct 7, 2011 at 11:58 AM, Alfredas Chmieliauskas <
[email protected]> wrote:

> I've copied the same page on non-https location and changed
> the protocol-httpclient to protocol-http. And the parser found 18 outlinks.
> So it seems that the problem is with the httpclient...
>
> Thanks Markus,
>
> A
>
>
> On Fri, Oct 7, 2011 at 11:36 AM, Markus Jelsma <[email protected]
> > wrote:
>
>> You're using parse-html, it should extract those relative outlinks just
>> fine.
>> Using protocol-httpclient should not make things different. But to rule it
>> out, can you parse the page from some other location using protocol-http
>> instead?
>>
>> Do you have any relevant non-default settings on your config?
>>
>> > Dear all,
>> >
>> > I've been trying to crawl and index a https intranet, but the generator
>> > keeps saying that there are 0 links to be fetched after authenticating
>> and
>> > parsing the first page. It seems that there's something wrong with the
>> > parser when used with https (httpclient).
>> >
>> > here's the command that I'm using to reproduce the error:
>> >
>> > bin/nutch org.apache.nutch.parse.ParserChecker
>> http://server/user/library
>> >
>> > cmd output:  http://pastebin.com/h5e7wAZ5
>> >
>> > hadoop.log: http://pastebin.com/S7ieS2TT (you can see the page is
>> fetched
>> > and the contents around line 300)
>> >
>> > Any ideas/help will be appreciated,
>> >
>> > Alfredas
>>
>
>

Re: Not finding links when using HTTPS (httpclient)

Reply via email to