patrick peck created NUTCH-2067:
-----------------------------------

             Summary: HttpFormAuthentication unable to decode login page when 
server responds with GZIP encoding
                 Key: NUTCH-2067
                 URL: https://issues.apache.org/jira/browse/NUTCH-2067
             Project: Nutch
          Issue Type: Bug
          Components: plugin, protocol
    Affects Versions: 1.10
            Reporter: patrick peck


The method 
org.apache.nutch.protocol.httpclient.HttpFormAuthentication#httpGetPageContent()
 which is used to download the login page when doing form authentication, fails 
to take into account that the response body may be gzip encoded which is 
possible given the fact that the Http.configureClient() method sets the 
Accept-Encoding header to "x-gzip, gzip, deflate".

It's also not possible to override the Accept-Encoding header, since it's 
overridden by the default (or, to be more exact: if you add an

    <additionalPostHeaders>
      <field name="Accept-Encoding" value="identity" />
    </additionalPostHeaders>

to the configuration, the http client sends out the Accept-Encoding header 
twice, first with the above configuration, second with the default 
configuration.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to