[issue25239] HTMLParser handle_starttag replaces entity references in attribute value even without semicolon

2015-09-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25239] HTMLParser handle_starttag replaces entity references in attribute value even without semicolon

2015-09-26 Thread Sean Liu

New submission from Sean Liu:

In the document of HTMLParser.handle_starttag, it states "All entity references 
from html.entities are replaced in the attribute values." However it will 
replace the string if it matches ampersand followed by the entity name without 
the semicolon.

For example foo will produce "t=buy¤cy=usd" 
as the value of href attribute due to "curren" is the entity name for the 
currency sign.

--
components: Library (Lib)
files: parserentity.py
messages: 251654
nosy: Sean Liu
priority: normal
severity: normal
status: open
title: HTMLParser handle_starttag replaces entity references in attribute value 
even without semicolon
type: behavior
versions: Python 3.4
Added file: http://bugs.python.org/file40588/parserentity.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25239] HTMLParser handle_starttag replaces entity references in attribute value even without semicolon

2015-09-26 Thread Ezio Melotti

Ezio Melotti added the comment:

This seems indeed to be a bug.  The relevant bit seems to be at 
http://www.w3.org/TR/html5/syntax.html#consume-a-character-reference :

"""
If the character reference is being consumed as part of an attribute, and the 
last character matched is not a ";" (U+003B) character, and the next character 
is either a "=" (U+003D) character or an alphanumeric ASCII character, then, 
for historical reasons, all the characters that were matched after the U+0026 
AMPERSAND character (&) must be unconsumed, and nothing is returned. However, 
if this next character is in fact a "=" (U+003D) character, then this is a 
parse error, because some legacy user agents will misinterpret the markup in 
those cases.
"""

Off the top of my head, this paragraph is not implemented in HTMLParser (and it 
should).
Also note that foo is not valid HTML and 
the & should have been escaped with .

--
assignee:  -> ezio.melotti
stage:  -> test needed
versions: +Python 2.7, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com