New submission from flying sheep:
hi, i have an idea on how to make an internal change to html.parser.HTMLParser,
which would expose a token generator interface.
after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or
even
parser = HTMLParser()
for chunk in pipe_in_html():
yield from parser.tokenize(chunk)
---
the changes affect excluively HTMLParser’s methods and would unfortunately
require a behavior change to most (internal) parse_* methods. the changes go as
follows:
1. the tokenize(data=None, end=False) method is added. it contains mainly
goahead’s body with an prepended snippet to append passed data to raw_data, and
all handle_* calls changed to "yield token, data".
2. all parse_* methods which returned an int and called one handle_* method are
changed to return an (int, token) tuple (so that tokenize can yield the tokens)
3. goahead is changed to a skeleton implementation based on traversing the list
created by tokenize, experiencing no changed behavior.
all changes would only affect the behavior of the parse_* methods, and the
addition of the tokenize method: the tokens are discarded if goahead, feed, or
close are called. (this can of course be changed if advisable)
---
since this is my first contribution, i’m unsure if i shall already add the
patch, unknowing if the changes to the internal parse_* methods are acceptable
at all. what do you say?
PS: the tokens are named like the handle_* methods, and the current goahead
implementation basically calls getattr(self, 'handle_' + token)(data) for each
(token, data) tuple. This can be changed to a token: method dict or a classic
“switch” elif stack.
----------
messages: 184096
nosy: flying sheep
priority: normal
severity: normal
status: open
title: Generator-based HTMLParser
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue17410>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com