[issue27716] http.client truncates UTF-8 encoded headers

Cory Benfield Tue, 09 Aug 2016 04:34:37 -0700

New submission from Cory Benfield:

Originally reported as Requests issue #3485: 
https://github.com/kennethreitz/requests/issues/3485


On Python 3, http.client uses the email module to parse its HTTP headers. The 
email module, for better or worse, requires that it parse headers as *text*: 
that is, that they be decoded from bytes first and then parsed.

This doesn't work for UTF-8 encoded headers. For example, the URL 
`'http://pl.bab.la/slownik/angielski-polski/'` returns the following Link 
header, encoded as UTF-8: `Link: <http://www.babla.cn/英语-波兰语/>; 
rel="alternate"; hreflang="zh-Hans", 
<http://cs.bab.la/slovnik/anglicky-polsky/>; rel="alternate"; hreflang="cs", 
<http://da.bab.la/ordbog/engelsk-polsk/>; rel="alternate"; hreflang="da", 
<http://de.bab.la/woerterbuch/englisch-polnisch/>; rel="alternate"; 
hreflang="de", <http://www.babla.gr/αγγλικα-πολωνικα/>; rel="alternate"; 
hreflang="el", <http://en.bab.la/dictionary/english-polish/>; rel="alternate"; 
hreflang="en", <http://eo.bab.la/vortaro/angla-pola/>; rel="alternate"; 
hreflang="eo", <http://es.bab.la/diccionario/ingles-polaco/>; rel="alternate"; 
hreflang="es", <http://fi.bab.la/sanakirja/englanti-puola/>; rel="alternate"; 
hreflang="fi", <http://fr.bab.la/dictionnaire/anglais-polonais/>; 
rel="alternate"; hreflang="fr", <http://www.babla.in/अंग्रेज़ी-पोलिश/>; 
rel="alternate"; hreflang="hi", <http://hu.bab.la/szótár/angol-lengyel/>; 
rel="alternate"; hreflang="hu", 
<http://www.babla.co.id/bahasa-inggris-bahasa-polandia/>; rel="alternate"; 
hreflang="id", <http://it.bab.la/dizionario/inglese-polacco/>; rel="alternate"; 
hreflang="it", <http://ja.bab.la/辞書/英語-ポーランド語/>; rel="alternate"; 
hreflang="ja", <http://www.babla.kr/영어-폴란드어/>; rel="alternate"; hreflang="ko", 
<http://nl.bab.la/woordenboek/engels-pools/>; rel="alternate"; hreflang="nl", 
<http://www.babla.no/engelsk-polsk/>; rel="alternate"; hreflang="no", 
<http://pl.bab.la/slownik/angielski-polski/>; rel="alternate"; hreflang="pl", 
<http://pt.bab.la/dicionario/ingles-polones/>; rel="alternate"; hreflang="pt", 
<http://ro.bab.la/dictionar/engleza-poloneza/>; rel="alternate"; hreflang="ro", 
<http://www.babla.ru/английский-польский/>; rel="alternate"; hreflang="ru", 
<http://sv.bab.la/lexikon/engelsk-polsk/>; rel="alternate"; hreflang="sv", 
<http://sw.bab.la/kamusi/kiingereza-kipolishi/>; rel="alternate"; 
hreflang="sw", <http://www.babla.co.th/english-polish/>; rel="alternate"; 
hreflang="th", <http://tr.bab.la/sozluk/ingilizce-lehce/>; rel="alternate"; 
hreflang="tr", <http://www.babla.vn/tieng-anh-tieng-ba-lan/>; rel="alternate"; 
hreflang="vi"`.

When decoded using ISO-8859-1, this header gets truncated and this also causes 
the header block parsing to stop. This means that we don't see the 
Content-Length header, causing the HTTP client to wait for connection closure 
to consider the body terminated.

Really the only correct fix for this is for http.client to stop insisting that 
the headers be decoded before they are parsed, and instead to decode *after*. 
That way, at least, user code can recover the headers and handle them more 
sensibly.

----------
components: Library (Lib)
messages: 272236
nosy: Lukasa
priority: normal
severity: normal
status: open
title: http.client truncates UTF-8 encoded headers
versions: Python 3.5

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue27716>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue27716] http.client truncates UTF-8 encoded headers

Reply via email to