[ https://issues.apache.org/jira/browse/NUTCH-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2563: ----------------------------------- Fix Version/s: 1.15 > HTTP header spellchecking issues > -------------------------------- > > Key: NUTCH-2563 > URL: https://issues.apache.org/jira/browse/NUTCH-2563 > Project: Nutch > Issue Type: Sub-task > Affects Versions: 1.14 > Reporter: Gerard Bouchar > Priority: Major > Fix For: 1.15 > > > {color:#333333}When reading http headers, for each header, the > SpellCheckedMetadata class computes a Levenshtein distance between it and > every known header in the HttpHeaders interface. Not only is that slow, > non-standard, and non-conform to browsers' behavior, but it also causes bugs > and prevents us from accessing the real headers sent by the HTTP > server.{color} > * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a > *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects > it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) > tries to read the HTTP body as chunked, whereas it is not.{color} > {color:#333333}I personally think that HTTP header spell checking is a bad > idea, and that this logic should be completely removed. But if it were to be > kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER) should be higher > (we internally set it to 5 as a temporary fix for this issue){color} -- This message was sent by Atlassian JIRA (v7.6.3#76005)