[issue24363] httplib fails to handle semivalid HTTP headers

2015-06-03 Thread Michael Del Monte

Michael Del Monte added the comment:

Given that obs-fold is technically valid, then can I recommend reading the 
entire header first (reading to the first blank line) and then tokenizing the 
individual headers using a regular expression rather than line by line?  That 
would solve the problem more elegantly and easily than attempting to read lines 
on the fly and then unreading the nonconforming lines.

Here's my recommendation:

def readheaders(self):
self.dict = {}
self.unixfrom = ''
self.headers = hlist = []
self.status = ''
# read entire header (read until first blank line)
while True:
line = self.fp.readline(_MAXLINE+1)
if not line:
self.status = 'EOF in headers'
break
if len(line)  _MAXLINE:
raise LineTooLong(header line)
hlist.append(line)
if line in ('\n', '\r\n'):
break
if len(hlist)  _MAXHEADERS:
raise HTTPException(got more than %d headers % _MAXHEADERS)
# reproduce and parse as string
header = \n.join(hlist)
self.headers = re.findall(r[^ \n][^\n]+\n(?: +[^\n]+\n)*, header)
firstline = True
for line in self.headers:
if firstline and line.startswith('From '):
self.unixfrom = self.unixfrom + line
continue
firstline = False
if ':' in line:
k,v = line.split(':',1)
self.addheader(k, re.sub(\n +,  , v.strip()))
else:
self.status = 'Non-header line where header expected' if 
self.dict else 'No headers'


I think this handles everything you're trying to do.  I don't understand the 
unixfrom bit, but I think I have it right.

As for Cory's concern re: smuggling, _MAXLINE and _MAXHEADERS should help with 
that.  The regexp guarantees that every line plus continuation appears as a 
single header.

I use re.sub(\n +,  , v.strip()) to clean the value and remove the 
continuation.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24363
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24363] httplib fails to handle semivalid HTTP headers

2015-06-03 Thread Michael Del Monte

Michael Del Monte added the comment:

... or perhaps

if ':' in line and line[0] != ':':

to avoid the colon-as-first-char bug that plagued this library earlier, though 
the only ill-effect of leaving it alone would be a header with a blank key; not 
the end of the world.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24363
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24363] httplib fails to handle semivalid HTTP headers

2015-06-02 Thread Michael Del Monte

Michael Del Monte added the comment:

I don't want to speak out of school and you guys certainly know what you're 
doing, but it seems a shame to go through these gyrations -- lookahead plus 
unreading lines -- only to preserve the ability to parse email headers, when 
HTTP really does follow a different spec.  My suggestion would be to examine 
the header and decide, if it's HTTP, to just ignore nonconforming lines; and if 
it's email, then the problem is already solved (as email doesn't have encoding 
rules that would cause problems later).  

My fear would be that you'll eventually get that nonconforming line with 
leading whitespace, which will lead right back to the same error.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24363
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24363] httplib fails to handle semivalid HTTP headers

2015-06-02 Thread Michael Del Monte

New submission from Michael Del Monte:

Initially reported at https://github.com/kennethreitz/requests/issues/2622

Closely related to http://bugs.python.org/issue19996

An HTTP response with an invalid header line that contains non-blank characters 
but *no* colon (contrast http://bugs.python.org/issue19996 in which it 
contained a colon as the first character) causes the same behavior.

httplib.HTTPMessage.readheaders() oddly does not appear even to attempt to 
follow RFC 2616, which requires the header to terminate with a blank line.  The 
invalid header line, which admittedly also breaks RFC 2616, is at least 
non-blank and should not terminate the header.  Yet readheaders() takes it as 
an indicator that the header is over and then fails properly to process the 
rest of the response.

The problem is exacerbated by a chunked encoding, which will not be properly 
received if the encoding header is not seen because readheaders() terminates 
early.  An example (why are banks always the miscreants here?) is:

p = response.get(http://www.merrickbank.com/;)

My recommended fix would be to insert these lines at httplib:327

# continue reading headers on non-blank lines
elif not len(line.strip()):
continue
# break only on blank lines


This would cause readheaders() to terminate only on a non-blank non-header 
non-comment line, in accordance with RFC 2616.

--
components: Library (Lib)
messages: 244672
nosy: mgdelmonte
priority: normal
severity: normal
status: open
title: httplib fails to handle semivalid HTTP headers
type: behavior
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24363
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24363] httplib fails to handle semivalid HTTP headers

2015-06-02 Thread Michael Del Monte

Michael Del Monte added the comment:

Thanks.  Also I meant to have said, ...to terminate only on a *blank* 
non-header non-comment line, in accordance with RFC 2616 (and 7230).

I note that the RFCs require CRLF to terminate but in my experience you can get 
all manner of blank lines, so accepting len(line.strip())==0 is going to 
accommodate servers that give CRCR or LFLF or (and I have seen this) LFCRLF.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24363
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22760] re.sub does only first 16 replacements if re.S is used

2014-10-29 Thread Michael Del Monte

New submission from Michael Del Monte:

Easily reproduced:

re.sub('x', 'a', x*20, re.S)

returns ''

--
components: Regular Expressions
messages: 230216
nosy: ezio.melotti, mgdelmonte, mrabarnett
priority: normal
severity: normal
status: open
title: re.sub does only first 16 replacements if re.S is used
type: behavior
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22760
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com