[issue41748] HTMLParser: parsing error

2021-01-03 Thread karl

karl  added the comment:

Ezio,

TL,DR: Testing in browsers and adding two tests for this issue. 
   Should I create a PR just for the tests?

https://github.com/python/cpython/blame/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/test/test_htmlparser.py#L479-L485


A: comma without spaces
---


Tests for browsers:
data:text/html,text

Serializations:
* Firefox, Gecko (86.0a1 (2020-12-28) (64-bit)) 
* Edge, Blink (Version 89.0.752.0 (Version officielle) Canary (64 bits))
* Safari, WebKit (Release 117 (Safari 14.1, WebKit 16611.1.7.2))

Same serialization in these 3 rendering engines
text


Adding:

def test_comma_between_unquoted_attributes(self):
# bpo 41748
self._run_check('',
[('starttag', 'div', [('class', 'bar,baz=asd')])])


❯ ./python.exe -m test -v test_htmlparser

…
test_comma_between_unquoted_attributes 
(test.test_htmlparser.HTMLParserTestCase) ... ok
…

Ran 47 tests in 0.168s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 369 ms
Tests result: SUCCESS


So this is working as expected for the first test.


B: comma with spaces


Tests for browsers:
data:text/html,text

Serializations:
* Firefox, Gecko (86.0a1 (2020-12-28) (64-bit)) 
* Edge, Blink (Version 89.0.752.0 (Version officielle) Canary (64 bits))
* Safari, WebKit (Release 117 (Safari 14.1, WebKit 16611.1.7.2))

Same serialization in these 3 rendering engines
text


Adding
def test_comma_with_space_between_unquoted_attributes(self):
# bpo 41748
self._run_check('',
[('starttag', 'div', [
('class', 'bar'),
(',baz', 'asd')])])


❯ ./python.exe -m test -v test_htmlparser


This is failing.

==
FAIL: test_comma_with_space_between_unquoted_attributes 
(test.test_htmlparser.HTMLParserTestCase)
--
Traceback (most recent call last):
  File "/Users/karl/code/cpython/Lib/test/test_htmlparser.py", line 493, in 
test_comma_with_space_between_unquoted_attributes
self._run_check('',
  File "/Users/karl/code/cpython/Lib/test/test_htmlparser.py", line 95, in 
_run_check
self.fail("received events did not match expected events" +
AssertionError: received events did not match expected events
Source:
''
Expected:
[('starttag', 'div', [('class', 'bar'), (',baz', 'asd')])]
Received:
[('data', '')]

--


I started to look into the code of parser.py which I'm not familiar (yet) with.

https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/html/parser.py#L42-L52

Do you have a suggestion to fix it?

--
nosy: +karlcow

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-09 Thread Ezio Melotti


Ezio Melotti  added the comment:

The html.parser follows the HTML 5 specs as closely as possible.  There are a 
few corner cases where it behaves slightly differently but it's only while 
dealing with invalid markup, and the differences should be trivial and 
generally not worth the extra complexity to deal with them.

In this case, if I recall correctly, the way the comma is handled is just a 
left-over from the previous version of the parser, that predates the HTML 5 
specs.  In tags like  there was an ambiguous situation and 
parsing it  was deemed a reasonable interpretation, so 
the comma was treated as an attribute separator (and there should be test cases 
for this).

This likely caused the issue reported by the OP, and I think it should be 
fixed, even if technically it's a change in behavior and will break some of the 
tests.

If I'm reading the specs[0] correctly:
*  should be parsed as , and
*  should be parsed as , where 
',baz' is the attribute name


> Also, there is no warning about security in the html.parser documentation?

I'm not aware of any specific security issues, since html.parser just 
implements the parser described by the HTML 5 specs.  If there are any security 
issues caused by divergences from the specs, they should be fixed.  I'm not 
sure why a warning would be needed.

> Is this module mature and maintained enough to be considered as reliable?

Even though it hasn't been updated to the latest version of the specs (5.2 at 
the time of writing), it has been updated to implement the parsing rules 
described by the HTML 5 specs.  I don't know if the parsing rules changed 
between 5.0 and 5.2.

> Or should we warn users about possible issues on corner cases, and point to 
> BeautilfulSoup for a more mature HTML parser?

BeautifulSoup is built on top of html.parser (and can also use other parses, 
like lxml).  BS uses the underlying parsers to parse the HTML, then builds the 
tree and provides, among other things, functions to search and edit it.
When I upgraded html.parser to HTML 5 I worked with the author of BeautifulSoup 
(Leonard Richardson), to make sure that my changes were compatible with BS. We 
also discussed about some corner cases he found and other feature requests and 
issues he had with the old version of the parser.  That said, a link to BS 
would be a nice addition, since it's a great library.


[0] starting from 
https://www.w3.org/TR/html52/syntax.html#tokenizer-before-attribute-name-state

--
nosy: +ezio.melotti
stage:  -> test needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-09 Thread STINNER Victor


STINNER Victor  added the comment:

Also, there is no warning about security in the html.parser documentation? Is 
this module mature and maintained enough to be considered as reliable? Or 
should we warn users about possible issues on corner cases, and point to 
BeautilfulSoup for a more mature HTML parser?

https://docs.python.org/dev/library/html.parser.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-09 Thread STINNER Victor


STINNER Victor  added the comment:

HTMLParser.check_for_whole_start_tag() uses locatestarttagend_tolerant regular 
expression to find the end of the start tag. This regex cuts the string at the 
first comma (","), but not if the comma is the first character of an attribute 
name

* '' => '' => ' BUG

The regex is quite complex:

locatestarttagend_tolerant = re.compile(r"""
  <[a-zA-Z][^\t\n\r\f />\x00]*   # tag name
  (?:[\s/]*  # optional whitespace before attribute name
(?:(?<=['"\s/])[^\s/>][^\s/=>]*  # attribute name
  (?:\s*=+\s*# value indicator
(?:'[^']*'   # LITA-enclosed value
  |"[^"]*"   # LIT-enclosed value
  |(?!['"])[^>\s]*   # bare value
 )
 (?:\s*,)*   # possibly followed by a comma
   )?(?:\s|/(?!>))*
 )*
   )?
  \s*# trailing whitespace
""", re.VERBOSE)
endendtag = re.compile('>')

The problem is that this part of the regex:

#(?:\s*,)*   # possibly followed by a comma

The comma is not seen as part of the attribute name.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-09 Thread Ademar Nowasky Junior


Ademar Nowasky Junior  added the comment:

Yes, I understand that in the same way. Both are valid attr names. 

Maybe it's worth noting that Javascript has no problem handling this.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-09 Thread STINNER Victor


STINNER Victor  added the comment:

HTML 5.2 specification says
https://www.w3.org/TR/html52/syntax.html#elements-attributes

"Attribute names must consist of one or more characters other than the space 
characters, U+ NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), 
U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) 
characters, the control characters, and any characters that are not defined by 
Unicode."

It is confirmed in the "12.2 Parsing HTML documents" section of the "HTML 
Living Standard":
"""
12.2.5.33 Attribute name state

Consume the next input character:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
U+002F SOLIDUS (/)
U+003E GREATER-THAN SIGN (>)
EOF
Reconsume in the after attribute name state.
U+003D EQUALS SIGN (=)
Switch to the before attribute value state.
ASCII upper alpha
Append the lowercase version of the current input character (add 0x0020 to 
the character's code point) to the current attribute's name.
U+ NULL
This is an unexpected-null-character parse error. Append a U+FFFD 
REPLACEMENT CHARACTER character to the current attribute's name.
U+0022 QUOTATION MARK (")
U+0027 APOSTROPHE (')
U+003C LESS-THAN SIGN (<)
This is an unexpected-character-in-attribute-name parse error. Treat it as 
per the "anything else" entry below.
Anything else
Append the current input character to the current attribute's name.
"""
https://html.spec.whatwg.org/multipage/parsing.html#attribute-name-state

I understand that "," *is* a legit character in a HTML attribute name. So "a," 
and ",a" *are* valid HTML attribute names. Do I understand correctly?

--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-09 Thread Ademar Nowasky Junior


Change by Ademar Nowasky Junior :


--
type: security -> crash

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-08 Thread Ademar Nowasky Junior


Change by Ademar Nowasky Junior :


--
type: crash -> security

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41748] HTMLParser: parsing error

2020-09-08 Thread Ademar Nowasky Junior


New submission from Ademar Nowasky Junior :

HTML tags that have a attribute name starting with a comma character aren't 
parsed and break future calls to feed(). 

The problem occurs when such attribute is the second one or later in the HTML 
tag. Doesn't seems to affect when it's the first attribute.

#POC:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)

parser = MyHTMLParser()

#This is ok
parser.feed('')

#This breaks
parser.feed('')

#Future calls to feed() will not work
parser.feed('')

--
components: Library (Lib)
messages: 376607
nosy: nowasky.jr
priority: normal
severity: normal
status: open
title: HTMLParser: parsing error
type: crash
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com