[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508006#comment-16508006
 ] 

Sebastian Nagel commented on NUTCH-2549:
----------------------------------------

Hi [~gbouchar], PR is open to fix all sub-tasks. I've also took your 
evilserver.py (attached to NUTCH-2561) as inspiration for unit tests. When 
testing Nutch 1.14 these fail (as expected):
{noformat}
% grep -A1 -i testcase 
build/protocol-http/test/TEST-org.apache.nutch.protocol.http.TestBadServerResponses.txt
Testcase: testBadHttpServer took 0.257 sec
Testcase: testNoStatusLine took 0.091 sec
        Caused an ERROR
--
Testcase: testOverlongHeader took 0.48 sec
        FAILED
--
Testcase: testContentLengthNotANumber took 0.075 sec
        Caused an ERROR
--
Testcase: testHeaderSpellChecking took 0.065 sec
        Caused an ERROR
--
Testcase: testMultiLineHeader took 0.066 sec
Testcase: testHeaderWithColon took 0.098 sec
        Caused an ERROR
--
Testcase: testChunkedContent took 0.088 sec
        FAILED
--
Testcase: testRequestNotStartingWithSlash took 0.094 sec
        FAILED
--
Testcase: testIgnoreErrorInRedirectPayload took 0.065 sec
        Caused an ERROR
{noformat}

> protocol-http does not behave the same as browsers
> --------------------------------------------------
>
>                 Key: NUTCH-2549
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2549
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>         Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#333333}bad chunk length' error.{color}
> {color:#333333}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to