[issue23144] html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text

Ezio Melotti Sat, 07 Mar 2015 16:08:26 -0800

Ezio Melotti added the comment:

> I still think it would be worthwhile adding close() calls to
> the examples in the documentation (Doc/library/html.parser.rst).


If I add context manager support to HTMLParser I can update the examples to use 
it, but otherwise I don't think it's worth changing them now.

> BTW I haven’t tested this, and maybe it is not a concern, but even with
> this patch it looks like the parser will buffer unlimited data and
> output nothing until close() if each string it is fed ends with an 
> ampersand (and otherwise contains only plain text, no tags etc).

This is true, but I don't think it's a realistic case.
For this to be a problem you would need:
1) Someone feeding the parser with arbitrary chunks.  Text files are usually 
fed to the parser whole, or line by line -- arbitrary chunks are uncommon.
2) A file that contains lot of entities.  In most documents charrefs are not 
very common, and so the chances that a chunk will split one in the middle is 
low.  Chances that several consecutive charrefs are split in the middle is even 
lower.
3) A file that is very big.  Even if all the file is buffered until a call to 
close(), it shouldn't be a concern, since most files have relatively small 
size.  It is true that this has a quadratic complexity, but I would expect the 
parsing to complete in a reasonable time for average sizes.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue23144>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue23144] html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text

Reply via email to