[
https://issues.apache.org/jira/browse/NUTCH-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509870#comment-16509870
]
Hudson commented on NUTCH-2563:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See
[https://builds.apache.org/job/Nutch-trunk/3534/])
NUTCH-2563 HTTP header spellchecking issues ("Client-Transfer-Encoding"
(snagel:
[https://github.com/apache/nutch/commit/381e82ff0a891d899ac8541d6a30f0d12633d247])
* (edit) src/java/org/apache/nutch/metadata/HttpHeaders.java
* (edit)
src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http/TestBadServerResponses.java
* (edit) src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
> HTTP header spellchecking issues
> --------------------------------
>
> Key: NUTCH-2563
> URL: https://issues.apache.org/jira/browse/NUTCH-2563
> Project: Nutch
> Issue Type: Sub-task
> Affects Versions: 1.14
> Reporter: Gerard Bouchar
> Priority: Major
> Fix For: 1.15
>
>
> {color:#333333}When reading http headers, for each header, the
> SpellCheckedMetadata class computes a Levenshtein distance between it and
> every known header in the HttpHeaders interface. Not only is that slow,
> non-standard, and non-conform to browsers' behavior, but it also causes bugs
> and prevents us from accessing the real headers sent by the HTTP
> server.{color}
> * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http)
> tries to read the HTTP body as chunked, whereas it is not.{color}
> {color:#333333}I personally think that HTTP header spell checking is a bad
> idea, and that this logic should be completely removed. But if it were to be
> kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER) should be higher
> (we internally set it to 5 as a temporary fix for this issue){color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)