Interesting idea, and sounds fine to me. Concretely, this lets me sidestep SHINDIG-500 for a little while, which is nice (though I'd _really_ like to see the API cleanup go in! :)), in favor of migrating the existing rewriter to a tree-based approach. Turns out I've been working on #1 and #2 independently anyway. I'll post a patch soon. Thanks!
John On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <[EMAIL PROTECTED]> wrote: > Can we prove this out incrementally bottom-up. In general I think using DOM > is the right thing to do from a rewriting standpoint. So here's how I > propose we proceed > > 1. If the Caja dom is a little awkward wrap it, if not lets just use it as > is. We can always resolve this later > 2. Change the existing content rewriters to use the DOM instead of a lexer, > should be pretty easy. Maybe add some fancier rewriting like moving CSS > into > HEAD > 3. Do some perf testing, look into memory overhead of dom transformation > etc. > 4. Alter GadgetSpec's to retain the dom when they are cached > 5. Alter the gadget rendering phase to serialize the content of the dom to > output > 6. Annotate the dom at parse time to make render time user-pref > substituions > faster, this should be easy enough too... > > This should be enough to prove out the pipeline end-to-end and identify any > major perf niggles. Once this is done we can look into how to inject a > rewriter pipeline into the parsing phase and the rendering phase. > > -Louis > > > > On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote: > > > Re-responding in order to apply the last few exchanges to > > google-caja-discuss@ (@gmail vs. @google membership issues). > > > > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <[EMAIL PROTECTED]> > wrote: > > > > > Hello, > > > > > > While beginning to refactor the rewriter APIs I've discovered that > there > > > unfortunately is one semantic difference inherent to moving > getContent() > > and > > > setContent() methods into the Gadget object (replacing > > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer > caches > > > rewritten content. > > > > > > I've written a discussion of this in issue SHINDIG-500, which tracks > this > > > implementation sub-task: > > https://issues.apache.org/jira/browse/SHINDIG-500 > > > > > > To summarize: > > > 1. Is this change acceptable for the time being? > > > 2. I suggest that we can, at a later date, move fetching of gadget > specs > > > into GadgetServer while injecting a Gadget(Spec) cache there as well, > > > offering finer-tuned control over caching characteristics. > > > > > > Thanks, > > > John > > > > > > > > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]> > > wrote: > > > > > >> I understand these concerns, and should be clear that I don't (despite > > my > > >> personal interest in experimenting with the idea, agreed that we don't > > have > > >> time for it at the moment) have any plans to introduce this sort of > RPC > > >> anywhere - certainly not in Shindig itself, as any such call would be > > hidden > > >> behind an interface anyway. > > >> > > >> Putting the RPC hypothetical aside, I still feel that there's value to > > >> implementing HTML parsing in terms of an interface: > > >> * Clearer separation of concerns/boundary between projects. > > >> - Corollary simplicity in testing. > > >> * Clearer API for content manipulation (that doesn't require knowledge > > of > > >> Caja). > > >> > > >> I could be convinced otherwise, but at this point the code involved > > seems > > >> of manageable size, so still worth doing. Thoughts? > > >> > > >> John > > >> > > >> > > >> > > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]> wrote: > > >> > > >>> I agree with Louis -- that's just not practical. Every rewriting > > >>> operation > > >>> must work in real time. Caja's existing html parser is adequate for > our > > >>> needs, and we shouldn't go out of our way to tolerate every oddity of > > >>> random > > >>> web browsers (especially as it simply wouldn't work unless you farmed > > it > > >>> out > > >>> to *every* browser). Any new code needs to be grounded in practical, > > >>> current > > >>> needs, not theoretical options. We can always change code later if we > > >>> find a > > >>> real need for something like that. We have real work to do in the > > >>> meantime. > > >>> > > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]> > wrote: > > >>> > > >>> > John, > > >>> > > > >>> > From a practicality standpoint I'm a little nervous about this plan > > to > > >>> make > > >>> > RPCs calls out of a Java process to a native process to fetch a > parse > > >>> tree > > >>> > for transformations that have to occur realtime. I don't think the > > >>> > motivating factor here is to accept all inputs that browsers can. > > >>> Gadget > > >>> > developers will tailor their markup to the platform as they have > done > > >>> > already. I would greatly prefer us to pick one 'good' parser and > > stick > > >>> with > > >>> > it for all the manageability and consumability benefits that come > > with > > >>> that > > >>> > decision. Perhaps Im missing something here? > > >>> > > > >>> > -Louis > > >>> > > > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <[EMAIL PROTECTED] > > > > >>> wrote: > > >>> > > > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]> > > wrote: > > >>> > > > > >>> > > > [+google-caja-discuss] > > >>> > > > > > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad < > [EMAIL PROTECTED] > > > > > >>> > wrote: > > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <[EMAIL PROTECTED]> > > >>> wrote: > > >>> > > > > > > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad < > > >>> [EMAIL PROTECTED]> > > >>> > > > wrote: > > >>> > > > >> > This proposal effectively enables the renderer to become a > > >>> > > multi-pass > > >>> > > > >> > compiler for gadget content (essentially, arbitrary web > > >>> content). > > >>> > > Such > > >>> > > > a > > >>> > > > >> > compiler can provide several benefits: static optimization > > of > > >>> > gadget > > >>> > > > >> content > > >>> > > > >> > (auto-proxying of images, whitespace/comment removal, > > >>> > consolidation > > >>> > > of > > >>> > > > >> CSS > > >>> > > > >> > blocks), security benefits (caja et al), new functionality > > >>> > > (annotation > > >>> > > > of > > >>> > > > >> > content for stats, document analysis, container-specific > > >>> > features), > > >>> > > > etc. > > >>> > > > >> To > > >>> > > > >> > my knowledge no such infrastructure exists today (with the > > >>> > possible > > >>> > > > >> > exception of Caja itself, which I'd like to dovetail with > > this > > >>> > > work). > > >>> > > > >> > > >>> > > > >> Caja clearly provides a large chunk of the code you'd need > for > > >>> this. > > >>> > > > >> I'd like to hear how we'd manage to avoid duplication > between > > >>> the > > >>> > two > > >>> > > > >> projects. > > >>> > > > >> > > >>> > > > >> A generalised framework for manipulating content sounds like > a > > >>> great > > >>> > > > >> idea, but probably should not live in either of the two > > projects > > >>> > (Caja > > >>> > > > >> and Shindig) but rather should be shared by both of them, I > > >>> suspect. > > >>> > > > > > > >>> > > > > > > >>> > > > > I agree on both counts. As I mentioned, the piece of this > idea > > >>> that I > > >>> > > > expect > > >>> > > > > to change the most is the parse tree, and Caja's .parser.html > > and > > >>> > > > > .parser.css packages contain much of what I've thrown in here > > as > > >>> a > > >>> > > base. > > >>> > > > > > > >>> > > > > My key requirements are: > > >>> > > > > * Lightweight framework. > > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the > > good > > >>> work > > >>> > > > done > > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd > bet) > > >>> > > > > * Automatic maintenance of DOM<->String conversion. > > >>> > > > > * Easy to manipulate structure. > > >>> > > > > > >>> > > > I'm not sure what the value of parser modularity is? If the > > >>> resulting > > >>> > > > tree is different, then that's a problem for people processing > > the > > >>> > > > tree. And if it is not, then why do we care? > > >>> > > > > >>> > > > > >>> > > IMO the value of parser modularity is that the lenient parsers > > native > > >>> to > > >>> > > browsers can be used in place of those that might not accept all > > >>> inputs. > > >>> > > One > > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into > a > > >>> > server > > >>> > > that runs parallel to Shindig and provides a "local RPC" service > > for > > >>> > > parsing > > >>> > > semi-structured HTML. The resulting tree for WebKit's parser > might > > be > > >>> > > different than that for an XHTML parser, Gecko's parser, etc, but > > if > > >>> the > > >>> > > algorithm implemented atop it is rule-based rather than > > >>> strict-structure > > >>> > > based that should be fine, no? > > >>> > > > > >>> > > > > >>> > > > > > >>> > > > > > >>> > > > > > > >>> > > > > I'd love to see both projects share the same base syntax tree > > >>> > > > > representations. I considered .parser.html(.DomTree) and > > >>> .parser.css > > >>> > > for > > >>> > > > > these, but at the moment these appeared to be a little more > > tied > > >>> to > > >>> > > > Caja's > > >>> > > > > lexer/parser implementation than I preferred (though I admit > > >>> > > > > AbstractParseTreeNode contains most of what's needed). > > >>> > > > > > > >>> > > > > To be sure, I don't see this as an end-all-be-all > > transformation > > >>> > system > > >>> > > > in > > >>> > > > > any way. I'd just like to put *something* reasonable in place > > >>> that we > > >>> > > can > > >>> > > > > play with, provide some benefit, and enhance into a truly > > >>> > sophisticated > > >>> > > > > vision of document rewriting. > > >>> > > > > > > >>> > > > > > > >>> > > > >> > > >>> > > > >> > > >>> > > > >> > c. Add Gadget.getParsedContent(). > > >>> > > > >> > i. Returns a mutable GadgetContentParseTree used to > > >>> manipulate > > >>> > > > Gadget > > >>> > > > >> > Contents. > > >>> > > > >> > ii. Mutable tree calls back to the Gadget object > > indicating > > >>> > when > > >>> > > > any > > >>> > > > >> > change is made, and emits an error if setContent() has > been > > >>> called > > >>> > > in > > >>> > > > the > > >>> > > > >> > interim. > > >>> > > > >> > > >>> > > > >> In Caja we have been moving towards immutable trees... > > >>> > > > > > > >>> > > > > > > >>> > > > > Interested to hear more about this. The whole idea is for the > > >>> > gadget's > > >>> > > > tree > > >>> > > > > representation to be modifiable. Doing that with immutable > > trees > > >>> to > > >>> > me > > >>> > > > > suggests that a rewriter would have to create a completely > new > > >>> tree > > >>> > and > > >>> > > > set > > >>> > > > > it as a representation of new content. That's convenient as > far > > >>> as > > >>> > the > > >>> > > > > Gadget's maintenance of String<->Tree representations is > > >>> concerned... > > >>> > > but > > >>> > > > > seems pretty heavyweight for many types of edits: in-situ > > >>> > modifications > > >>> > > > of > > >>> > > > > text, content reordering, etc. That's particularly so in a > > >>> > > > single-threaded > > >>> > > > > (viz rewriting) environment. > > >>> > > > > > >>> > > > Never having been entirely sold on the concept, I'll let those > on > > >>> the > > >>> > > > Caja team who advocate immutability explain why. > > >>> > > > > > >>> > > > > >>> > > > >>> > > >> > > >> > > > > > >

