Thomas Broyer wrote:
> 2007/6/5, Sam Ruby:
>> I'd like to repackage the sanitizer, highlighter, and serializer filters
>> (like the optional tag filter) as standalone filters. These could be
>> used independent of the builder or serializer used. Each filter can
>> become an option of parse.
>
> Wrt what you started to do in ruby, I'd like to go a little farther:
> refactor the HTMLParser API so that the tokenizer passed is an
> instance rather than a class (and then add convenience functions such
> as parse and parseFragment). This would allow passing an instance of a
> HTMLTokenizer or a treewalker (useful to convert a tree from DOM to
> ElementTree for example), proxied by an HTMLSanitizer and/or the
> OptionalTagFilter (might be useful for tests) and/or hilighter, etc.
I've been trying to keep the ruby and python versions roughly in sync...
> HTMLParser and HTMLTokenizer would need to be refactored a bit so that
> HTMLParser don't assume the tokenizer is an HTMLTokenizer (e.g. don't
> access tokenizer.stream directly but use methods such as
> tokenizer.position()).
Yes.
> I haven't thought about it much so I don't know if it's doable, but it
> would be cool (there might be a problem with the HTMLParser setting
> the tokenizer's contentModelFlag, when the tokenizer is not an
> HTMLTokenizer).
The easiest way to proceed is for:
1) the _parse method to check for hasattr(stream, 'contentModelFlag'),
and if so, use the stream as the tokenizer; otherwise construct
a tokenizer for the stream.
2) define a base class for filters that has a constructor which
accepts a stream, and default implementations of contentModelFlag
and position which simply proxy/forward the calls onto the stream.
At the moment, HTMLSanitizer is a subclass of HTMLTokenizer, so the
forwarding is not required, but changing this to a mechanism based on
forwarding allows one to construct a pipe with an arbitrary number of
filters in any order that you like.
I'd also like to see all filters moved into a "filters" directory/module.
>> I'd like the input stream to actually stream for the common use cases of
>> windows-1252 or utf-8 input.
>
> Wrt to r674, why not use a codecs.StreamReader?
>
> Also, re "unreading" chars, maybe we could follow a pattern similar to
> Java-IO's mark()/reset().
> In Twintsam, I used PeekChar/PeekChars and ReadChar/ReadChars methods
> instead of "unreading" chars
> <http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs>
> There's also a SkipChars method, identical to ReadChars but
> "optimized" because it does not collect chars into a buffer.
>
> This could be used for encoding detection too (especially for
> non-seekable streams): use the internal buffer (queue) for bytes read
> from rawStream when detecting the encoding; then use it for chars read
> from the decoded stream when parsing.
>
> Finally, HTMLInputStream.reset() should call rawStream.seek(0), which
> means it would be only available if the stream is seekable (if you
> want to reparse a non-seekable stream, it's up to you to buffer it).
>
> Any thoughts?
All sounds good, but I'd like to get rid of the HTMLInputStream reset
method entirely.
The parse.py and parse.rb programs should accept a '-' as a filename,
and interpret that as meaning stdin.
Is there a reason why twintsam isn't simply placed in html5lib? I'd
love to see a set of html5 implementations all sharing a common test
suite base (augmented by language specific additions, where
appropriate). C# would be reasonable next language to tackle.
- Sam Ruby
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"html5lib-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---