Re: Content Rewriter Modularization: Design/Change

John Hjelmstad Tue, 12 Aug 2008 17:58:07 -0700

Re-responding in order to apply the last few exchanges to
google-caja-discuss@ (@gmail vs. @google membership issues).


On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> Hello,
>
> While beginning to refactor the rewriter APIs I've discovered that there
> unfortunately is one semantic difference inherent to moving getContent() and
> setContent() methods into the Gadget object (replacing
> View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
> rewritten content.
>
> I've written a discussion of this in issue SHINDIG-500, which tracks this
> implementation sub-task: https://issues.apache.org/jira/browse/SHINDIG-500
>
> To summarize:
> 1. Is this change acceptable for the time being?
> 2. I suggest that we can, at a later date, move fetching of gadget specs
> into GadgetServer while injecting a Gadget(Spec) cache there as well,
> offering finer-tuned control over caching characteristics.
>
> Thanks,
> John
>
>
> On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
>> I understand these concerns, and should be clear that I don't (despite my
>> personal interest in experimenting with the idea, agreed that we don't have
>> time for it at the moment) have any plans to introduce this sort of RPC
>> anywhere - certainly not in Shindig itself, as any such call would be hidden
>> behind an interface anyway.
>>
>> Putting the RPC hypothetical aside, I still feel that there's value to
>> implementing HTML parsing in terms of an interface:
>> * Clearer separation of concerns/boundary between projects.
>>   - Corollary simplicity in testing.
>> * Clearer API for content manipulation (that doesn't require knowledge of
>> Caja).
>>
>> I could be convinced otherwise, but at this point the code involved seems
>> of manageable size, so still worth doing. Thoughts?
>>
>> John
>>
>>
>>
>> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>>
>>> I agree with Louis -- that's just not practical. Every rewriting
>>> operation
>>> must work in real time. Caja's existing html parser is adequate for our
>>> needs, and we shouldn't go out of our way to tolerate every oddity of
>>> random
>>> web browsers (especially as it simply wouldn't work unless you farmed it
>>> out
>>> to *every* browser). Any new code needs to be grounded in practical,
>>> current
>>> needs, not theoretical options. We can always change code later if we
>>> find a
>>> real need for something like that. We have real work to do in the
>>> meantime.
>>>
>>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
>>>
>>> > John,
>>> >
>>> > From a practicality standpoint I'm a little nervous about this plan to
>>> make
>>> > RPCs calls out of a Java process to a native process to fetch a parse
>>> tree
>>> > for transformations that have to occur realtime. I don't think the
>>> > motivating factor here is to accept all inputs that browsers can.
>>> Gadget
>>> > developers will tailor their markup to the platform as they have done
>>> > already. I would greatly prefer us to pick one 'good' parser and stick
>>> with
>>> > it for all the manageability and consumability benefits that come with
>>> that
>>> > decision. Perhaps Im missing something here?
>>> >
>>> > -Louis
>>> >
>>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <[EMAIL PROTECTED]>
>>> wrote:
>>> >
>>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]> wrote:
>>> > >
>>> > > > [+google-caja-discuss]
>>> > > >
>>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <[EMAIL PROTECTED]>
>>> > wrote:
>>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <[EMAIL PROTECTED]>
>>> wrote:
>>> > > > >
>>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>>> [EMAIL PROTECTED]>
>>> > > > wrote:
>>> > > > >> > This proposal effectively enables the renderer to become a
>>> > > multi-pass
>>> > > > >> > compiler for gadget content (essentially, arbitrary web
>>> content).
>>> > > Such
>>> > > > a
>>> > > > >> > compiler can provide several benefits: static optimization of
>>> > gadget
>>> > > > >> content
>>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>>> > consolidation
>>> > > of
>>> > > > >> CSS
>>> > > > >> > blocks), security benefits (caja et al), new functionality
>>> > > (annotation
>>> > > > of
>>> > > > >> > content for stats, document analysis, container-specific
>>> > features),
>>> > > > etc.
>>> > > > >> To
>>> > > > >> > my knowledge no such infrastructure exists today (with the
>>> > possible
>>> > > > >> > exception of Caja itself, which I'd like to dovetail with this
>>> > > work).
>>> > > > >>
>>> > > > >> Caja clearly provides a large chunk of the code you'd need for
>>> this.
>>> > > > >> I'd like to hear how we'd manage to avoid duplication between
>>> the
>>> > two
>>> > > > >> projects.
>>> > > > >>
>>> > > > >> A generalised framework for manipulating content sounds like a
>>> great
>>> > > > >> idea, but probably should not live in either of the two projects
>>> > (Caja
>>> > > > >> and Shindig) but rather should be shared by both of them, I
>>> suspect.
>>> > > > >
>>> > > > >
>>> > > > > I agree on both counts. As I mentioned, the piece of this idea
>>> that I
>>> > > > expect
>>> > > > > to change the most is the parse tree, and Caja's .parser.html and
>>> > > > > .parser.css packages contain much of what I've thrown in here as
>>> a
>>> > > base.
>>> > > > >
>>> > > > > My key requirements are:
>>> > > > > * Lightweight framework.
>>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the good
>>> work
>>> > > > done
>>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
>>> > > > > * Automatic maintenance of DOM<->String conversion.
>>> > > > > * Easy to manipulate structure.
>>> > > >
>>> > > > I'm not sure what the value of parser modularity is? If the
>>> resulting
>>> > > > tree is different, then that's a problem for people processing the
>>> > > > tree. And if it is not, then why do we care?
>>> > >
>>> > >
>>> > > IMO the value of parser modularity is that the lenient parsers native
>>> to
>>> > > browsers can be used in place of those that might not accept all
>>> inputs.
>>> > > One
>>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
>>> > server
>>> > > that runs parallel to Shindig and provides a "local RPC" service for
>>> > > parsing
>>> > > semi-structured HTML. The resulting tree for WebKit's parser might be
>>> > > different than that for an XHTML parser, Gecko's parser, etc, but if
>>> the
>>> > > algorithm implemented atop it is rule-based rather than
>>> strict-structure
>>> > > based that should be fine, no?
>>> > >
>>> > >
>>> > > >
>>> > > >
>>> > > > >
>>> > > > > I'd love to see both projects share the same base syntax tree
>>> > > > > representations. I considered .parser.html(.DomTree) and
>>> .parser.css
>>> > > for
>>> > > > > these, but at the moment these appeared to be a little more tied
>>> to
>>> > > > Caja's
>>> > > > > lexer/parser implementation than I preferred (though I admit
>>> > > > > AbstractParseTreeNode contains most of what's needed).
>>> > > > >
>>> > > > > To be sure, I don't see this as an end-all-be-all transformation
>>> > system
>>> > > > in
>>> > > > > any way. I'd just like to put *something* reasonable in place
>>> that we
>>> > > can
>>> > > > > play with, provide some benefit, and enhance into a truly
>>> > sophisticated
>>> > > > > vision of document rewriting.
>>> > > > >
>>> > > > >
>>> > > > >>
>>> > > > >>
>>> > > > >> >  c. Add Gadget.getParsedContent().
>>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>>> manipulate
>>> > > > Gadget
>>> > > > >> > Contents.
>>> > > > >> >    ii. Mutable tree calls back to the Gadget object indicating
>>> > when
>>> > > > any
>>> > > > >> > change is made, and emits an error if setContent() has been
>>> called
>>> > > in
>>> > > > the
>>> > > > >> > interim.
>>> > > > >>
>>> > > > >> In Caja we have been moving towards immutable trees...
>>> > > > >
>>> > > > >
>>> > > > > Interested to hear more about this. The whole idea is for the
>>> > gadget's
>>> > > > tree
>>> > > > > representation to be modifiable. Doing that with immutable trees
>>> to
>>> > me
>>> > > > > suggests that a rewriter would have to create a completely new
>>> tree
>>> > and
>>> > > > set
>>> > > > > it as a representation of new content. That's convenient as far
>>> as
>>> > the
>>> > > > > Gadget's maintenance of String<->Tree representations is
>>> concerned...
>>> > > but
>>> > > > > seems pretty heavyweight for many types of edits: in-situ
>>> > modifications
>>> > > > of
>>> > > > > text, content reordering, etc. That's particularly so in a
>>> > > > single-threaded
>>> > > > > (viz rewriting) environment.
>>> > > >
>>> > > > Never having been entirely sold on the concept, I'll let those on
>>> the
>>> > > > Caja team who advocate immutability explain why.
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Content Rewriter Modularization: Design/Change

Reply via email to