Re: Minor html5lib refactoring

Sam Ruby Thu, 07 Jun 2007 05:06:52 -0700

Thomas Broyer wrote:
> 2007/6/5, Sam Ruby:
>> I'd like to repackage the sanitizer, highlighter, and serializer filters
>> (like the optional tag filter) as standalone filters.  These could be
>> used independent of the builder or serializer used.  Each filter can
>> become an option of parse.
> 
> Wrt what you started to do in ruby, I'd like to go a little farther:
> refactor the HTMLParser API so that the tokenizer passed is an
> instance rather than a class (and then add convenience functions such
> as parse and parseFragment). This would allow passing an instance of a
> HTMLTokenizer or a treewalker (useful to convert a tree from DOM to
> ElementTree for example), proxied by an HTMLSanitizer and/or the
> OptionalTagFilter (might be useful for tests) and/or hilighter, etc.


I've been trying to keep the ruby and python versions roughly in sync...

> HTMLParser and HTMLTokenizer would need to be refactored a bit so that
> HTMLParser don't assume the tokenizer is an HTMLTokenizer (e.g. don't
> access tokenizer.stream directly but use methods such as
> tokenizer.position()).

Yes.

> I haven't thought about it much so I don't know if it's doable, but it
> would be cool (there might be a problem with the HTMLParser setting
> the tokenizer's contentModelFlag, when the tokenizer is not an
> HTMLTokenizer).

The easiest way to proceed is for:
   1) the _parse method to check for hasattr(stream, 'contentModelFlag'),
      and if so, use the stream as the tokenizer; otherwise construct
      a tokenizer for the stream.
   2) define a base class for filters that has a constructor which
      accepts a stream, and default implementations of contentModelFlag
      and position which simply proxy/forward the calls onto the stream.

At the moment, HTMLSanitizer is a subclass of HTMLTokenizer, so the 
forwarding is not required, but changing this to a mechanism based on 
forwarding allows one to construct a pipe with an arbitrary number of 
filters in any order that you like.

I'd also like to see all filters moved into a "filters" directory/module.

>> I'd like the input stream to actually stream for the common use cases of
>> windows-1252 or utf-8 input.
> 
> Wrt to r674, why not use a codecs.StreamReader?
> 
> Also, re "unreading" chars, maybe we could follow a pattern similar to
> Java-IO's mark()/reset().
> In Twintsam, I used PeekChar/PeekChars and ReadChar/ReadChars methods
> instead of "unreading" chars
> <http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs>
> There's also a SkipChars method, identical to ReadChars but
> "optimized" because it does not collect chars into a buffer.
> 
> This could be used for encoding detection too (especially for
> non-seekable streams): use the internal buffer (queue) for bytes read
> from rawStream when detecting the encoding; then use it for chars read
> from the decoded stream when parsing.
> 
> Finally, HTMLInputStream.reset() should call rawStream.seek(0), which
> means it would be only available if the stream is seekable (if you
> want to reparse a non-seekable stream, it's up to you to buffer it).
> 
> Any thoughts?

All sounds good, but I'd like to get rid of the HTMLInputStream reset 
method entirely.

The parse.py and parse.rb programs should accept a '-' as a filename, 
and interpret that as meaning stdin.

Is there a reason why twintsam isn't simply placed in html5lib?  I'd 
love to see a set of html5 implementations all sharing a common test 
suite base (augmented by language specific additions, where 
appropriate).  C# would be reasonable next language to tackle.

- Sam Ruby

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to [email protected]
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---

Re: Minor html5lib refactoring

Reply via email to