Re: Content Rewriter Modularization: Design/Change

John Hjelmstad Tue, 12 Aug 2008 19:45:44 -0700

Interesting idea, and sounds fine to me. Concretely, this lets me sidestep
SHINDIG-500 for a little while, which is nice (though I'd _really_ like to
see the API cleanup go in! :)), in favor of migrating the existing rewriter
to a tree-based approach. Turns out I've been working on #1 and #2
independently anyway. I'll post a patch soon. Thanks!


John

On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:

> Can we prove this out incrementally bottom-up. In general I think using DOM
> is the right thing to do from a rewriting standpoint. So here's how I
> propose we proceed
>
> 1. If the Caja dom is a little awkward wrap it, if not lets just use it as
> is. We can always resolve this later
> 2. Change the existing content rewriters to use the DOM instead of a lexer,
> should be pretty easy. Maybe add some fancier rewriting like moving CSS
> into
> HEAD
> 3. Do some perf testing, look into memory overhead of dom transformation
> etc.
> 4. Alter GadgetSpec's to retain the dom when they are cached
> 5. Alter the gadget rendering phase to serialize the content of the dom to
> output
> 6. Annotate the dom at parse time to make render time user-pref
> substituions
> faster, this should be easy enough too...
>
> This should be enough to prove out the pipeline end-to-end and identify any
> major perf niggles. Once this is done we can look into how to inject a
> rewriter pipeline into the parsing phase and the rendering phase.
>
> -Louis
>
>
>
> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
> > Re-responding in order to apply the last few exchanges to
> > google-caja-discuss@ (@gmail vs. @google membership issues).
> >
> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hello,
> > >
> > > While beginning to refactor the rewriter APIs I've discovered that
> there
> > > unfortunately is one semantic difference inherent to moving
> getContent()
> > and
> > > setContent() methods into the Gadget object (replacing
> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
> caches
> > > rewritten content.
> > >
> > > I've written a discussion of this in issue SHINDIG-500, which tracks
> this
> > > implementation sub-task:
> > https://issues.apache.org/jira/browse/SHINDIG-500
> > >
> > > To summarize:
> > > 1. Is this change acceptable for the time being?
> > > 2. I suggest that we can, at a later date, move fetching of gadget
> specs
> > > into GadgetServer while injecting a Gadget(Spec) cache there as well,
> > > offering finer-tuned control over caching characteristics.
> > >
> > > Thanks,
> > > John
> > >
> > >
> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> I understand these concerns, and should be clear that I don't (despite
> > my
> > >> personal interest in experimenting with the idea, agreed that we don't
> > have
> > >> time for it at the moment) have any plans to introduce this sort of
> RPC
> > >> anywhere - certainly not in Shindig itself, as any such call would be
> > hidden
> > >> behind an interface anyway.
> > >>
> > >> Putting the RPC hypothetical aside, I still feel that there's value to
> > >> implementing HTML parsing in terms of an interface:
> > >> * Clearer separation of concerns/boundary between projects.
> > >>   - Corollary simplicity in testing.
> > >> * Clearer API for content manipulation (that doesn't require knowledge
> > of
> > >> Caja).
> > >>
> > >> I could be convinced otherwise, but at this point the code involved
> > seems
> > >> of manageable size, so still worth doing. Thoughts?
> > >>
> > >> John
> > >>
> > >>
> > >>
> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
> > >>
> > >>> I agree with Louis -- that's just not practical. Every rewriting
> > >>> operation
> > >>> must work in real time. Caja's existing html parser is adequate for
> our
> > >>> needs, and we shouldn't go out of our way to tolerate every oddity of
> > >>> random
> > >>> web browsers (especially as it simply wouldn't work unless you farmed
> > it
> > >>> out
> > >>> to *every* browser). Any new code needs to be grounded in practical,
> > >>> current
> > >>> needs, not theoretical options. We can always change code later if we
> > >>> find a
> > >>> real need for something like that. We have real work to do in the
> > >>> meantime.
> > >>>
> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]>
> wrote:
> > >>>
> > >>> > John,
> > >>> >
> > >>> > From a practicality standpoint I'm a little nervous about this plan
> > to
> > >>> make
> > >>> > RPCs calls out of a Java process to a native process to fetch a
> parse
> > >>> tree
> > >>> > for transformations that have to occur realtime. I don't think the
> > >>> > motivating factor here is to accept all inputs that browsers can.
> > >>> Gadget
> > >>> > developers will tailor their markup to the platform as they have
> done
> > >>> > already. I would greatly prefer us to pick one 'good' parser and
> > stick
> > >>> with
> > >>> > it for all the manageability and consumability benefits that come
> > with
> > >>> that
> > >>> > decision. Perhaps Im missing something here?
> > >>> >
> > >>> > -Louis
> > >>> >
> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <[EMAIL PROTECTED]
> >
> > >>> wrote:
> > >>> >
> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]>
> > wrote:
> > >>> > >
> > >>> > > > [+google-caja-discuss]
> > >>> > > >
> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> [EMAIL PROTECTED]
> > >
> > >>> > wrote:
> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <[EMAIL PROTECTED]>
> > >>> wrote:
> > >>> > > > >
> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> > >>> [EMAIL PROTECTED]>
> > >>> > > > wrote:
> > >>> > > > >> > This proposal effectively enables the renderer to become a
> > >>> > > multi-pass
> > >>> > > > >> > compiler for gadget content (essentially, arbitrary web
> > >>> content).
> > >>> > > Such
> > >>> > > > a
> > >>> > > > >> > compiler can provide several benefits: static optimization
> > of
> > >>> > gadget
> > >>> > > > >> content
> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> > >>> > consolidation
> > >>> > > of
> > >>> > > > >> CSS
> > >>> > > > >> > blocks), security benefits (caja et al), new functionality
> > >>> > > (annotation
> > >>> > > > of
> > >>> > > > >> > content for stats, document analysis, container-specific
> > >>> > features),
> > >>> > > > etc.
> > >>> > > > >> To
> > >>> > > > >> > my knowledge no such infrastructure exists today (with the
> > >>> > possible
> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
> > this
> > >>> > > work).
> > >>> > > > >>
> > >>> > > > >> Caja clearly provides a large chunk of the code you'd need
> for
> > >>> this.
> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
> between
> > >>> the
> > >>> > two
> > >>> > > > >> projects.
> > >>> > > > >>
> > >>> > > > >> A generalised framework for manipulating content sounds like
> a
> > >>> great
> > >>> > > > >> idea, but probably should not live in either of the two
> > projects
> > >>> > (Caja
> > >>> > > > >> and Shindig) but rather should be shared by both of them, I
> > >>> suspect.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > I agree on both counts. As I mentioned, the piece of this
> idea
> > >>> that I
> > >>> > > > expect
> > >>> > > > > to change the most is the parse tree, and Caja's .parser.html
> > and
> > >>> > > > > .parser.css packages contain much of what I've thrown in here
> > as
> > >>> a
> > >>> > > base.
> > >>> > > > >
> > >>> > > > > My key requirements are:
> > >>> > > > > * Lightweight framework.
> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
> > good
> > >>> work
> > >>> > > > done
> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd
> bet)
> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> > >>> > > > > * Easy to manipulate structure.
> > >>> > > >
> > >>> > > > I'm not sure what the value of parser modularity is? If the
> > >>> resulting
> > >>> > > > tree is different, then that's a problem for people processing
> > the
> > >>> > > > tree. And if it is not, then why do we care?
> > >>> > >
> > >>> > >
> > >>> > > IMO the value of parser modularity is that the lenient parsers
> > native
> > >>> to
> > >>> > > browsers can be used in place of those that might not accept all
> > >>> inputs.
> > >>> > > One
> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into
> a
> > >>> > server
> > >>> > > that runs parallel to Shindig and provides a "local RPC" service
> > for
> > >>> > > parsing
> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
> might
> > be
> > >>> > > different than that for an XHTML parser, Gecko's parser, etc, but
> > if
> > >>> the
> > >>> > > algorithm implemented atop it is rule-based rather than
> > >>> strict-structure
> > >>> > > based that should be fine, no?
> > >>> > >
> > >>> > >
> > >>> > > >
> > >>> > > >
> > >>> > > > >
> > >>> > > > > I'd love to see both projects share the same base syntax tree
> > >>> > > > > representations. I considered .parser.html(.DomTree) and
> > >>> .parser.css
> > >>> > > for
> > >>> > > > > these, but at the moment these appeared to be a little more
> > tied
> > >>> to
> > >>> > > > Caja's
> > >>> > > > > lexer/parser implementation than I preferred (though I admit
> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
> > >>> > > > >
> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> > transformation
> > >>> > system
> > >>> > > > in
> > >>> > > > > any way. I'd just like to put *something* reasonable in place
> > >>> that we
> > >>> > > can
> > >>> > > > > play with, provide some benefit, and enhance into a truly
> > >>> > sophisticated
> > >>> > > > > vision of document rewriting.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> > >>> manipulate
> > >>> > > > Gadget
> > >>> > > > >> > Contents.
> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> > indicating
> > >>> > when
> > >>> > > > any
> > >>> > > > >> > change is made, and emits an error if setContent() has
> been
> > >>> called
> > >>> > > in
> > >>> > > > the
> > >>> > > > >> > interim.
> > >>> > > > >>
> > >>> > > > >> In Caja we have been moving towards immutable trees...
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > Interested to hear more about this. The whole idea is for the
> > >>> > gadget's
> > >>> > > > tree
> > >>> > > > > representation to be modifiable. Doing that with immutable
> > trees
> > >>> to
> > >>> > me
> > >>> > > > > suggests that a rewriter would have to create a completely
> new
> > >>> tree
> > >>> > and
> > >>> > > > set
> > >>> > > > > it as a representation of new content. That's convenient as
> far
> > >>> as
> > >>> > the
> > >>> > > > > Gadget's maintenance of String<->Tree representations is
> > >>> concerned...
> > >>> > > but
> > >>> > > > > seems pretty heavyweight for many types of edits: in-situ
> > >>> > modifications
> > >>> > > > of
> > >>> > > > > text, content reordering, etc. That's particularly so in a
> > >>> > > > single-threaded
> > >>> > > > > (viz rewriting) environment.
> > >>> > > >
> > >>> > > > Never having been entirely sold on the concept, I'll let those
> on
> > >>> the
> > >>> > > > Caja team who advocate immutability explain why.
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Content Rewriter Modularization: Design/Change

Reply via email to