Re: Caja-based HtmlParser and Parser Overhaul (issue157161)

John Hjelmstad Fri, 04 Dec 2009 17:08:30 -0800

I'm confused. Gate crashing typically means the party's worth going to! :)

So as to prevent this topic from veering too off course, I proffer the
following overview of the CL for anyone interested to review. I understand
Paul and Louis are on board. All comments welcome.


The changes are arrayed as follows:
1. Refactoring. Previously a large amount of generic (ie. should apply to
any) HTML parser testing was stuck in the nekohtml-package tests. To the
maximum extent possible, this has been pulled into parse-package classes:
  - AbstractParsingTestBase includes helper methods for any parsing- or
serialization-based test. Pulled from AbstractParserAndSerializerTest
  - AbstractParserAndSerializer test contains several "common"
parse/serialize tests no matter the concrete impl.
  - AbstractSocialMarkupHtmlParserTest pulls the social-markup test from
Neko into a base class.
  - "Actual" tests are trivial subclasses of the abstract tests, providing a
GadgetHtmlParser instance.
  - Tests converted to jUnit 4 as a side note.
Subtleties:
  - Neko-based tests override a few base parse/serialize tests due to Neko
oddities. All test files have been moved to base or nekohtml subdir to
follow suit.

2. GadgetHtmlParser normalization implemented.
  - GadgetHtmlParser.normalizeFragment() removed - logic now inlined into
parseDom().
    + Rationale: IMO (open to discussion) the abstract parseDomImpl() API is
unnecessary/does too much. Pretty much all gadget HTML is treated as tag
soup and cleaned up. Having a base method whose contract is to give back
unmodified tag soup thus seems right to me, with a single implementation of
the normalization logic.
  - GadgetHtmlParser.parseDom() implements a large chunk of document
normalization logic. It takes tag soup as input and returns a valid HTML
document with a single top-level HTML element, in turn with two children:
head and body.
    + Multiple <head> nodes consolidated together. Likewise body.
    + Elements above first <head> -> end up in head.
    + Elements above first <body> -> end up in body.
    + Elements after <body> -> end up in body unless inside a <head> node.
    + <style> nodes pulled to <head> in relative order - only HTML-compliant
place for them, and no possibility that there will be conflicts (no
displayable elements in <head>).
  - OpenSocial template parsing MAY be done as a post-processing pass on
<script> nodes. Text found therein is treated as OS (X|HT)ML.
Subtleties:
  - Lots. @see parseDom() impl especially.
  - NekoSimplifiedHtmlParser still impl's separate logic for parseDomImpl
and parseFragmentImpl. I didn't dive into the difference and whether we
could actually get rid of parseDomImpl in this round.

3. CajaHtmlParser implementation.
  - Depends on Caja r3889 (pom.xml updated to reflect this).
  - Unfortunately, parseDomImpl() does top-level <html> node synthesis to
ensure document.getDocumentElement() returns it. This is for
NekoSimplified/Caja dual compatibility w/ GadgetHtmlParser base logic. As
noted, I'd prefer to move this synthesis code into
GadgetHtmlParser.parseDom() if possible.
  - Pretty straightforward past that. Defers to Caja's parser for fragment
processing. That's about it.

Misc: setValijaMode(true) removed from CajaContentRewriter, since it's now
default in the relevant Caja version.

-j-

On Fri, Dec 4, 2009 at 4:41 PM, Dan Shepherd
<[email protected]>wrote:

> Indeed :) sorry for gate crashing!
>
> On 5 Dec 2009 00:35, "John Hjelmstad" <[email protected]> wrote:
>
> At all the best Shindigs, people show up fashionably late.
>
> On Fri, Dec 4, 2009 at 4:32 PM, Dan Shepherd
> <[email protected]>wrote:
>
> > Call this a shining? shouldn't you folks be out partying ;) > > On 5 Dec
> 2009 00:28, "Kevin Brown...
>

Re: Caja-based HtmlParser and Parser Overhaul (issue157161)

Reply via email to