*Summary
*
I would like to refactor Shindig Java's content rewriter to allow multiple
independent rewriter modules to be registered with a server and apply to a
request. Each particular CL should be of reasonably modest size, so while
I'd like to get coding, I wanted first to describe the overall structure I'm
working toward to A) solicit feedback, so B) other implementations can
hopefully get a more holistic view of the approach and C) hopefully ensure
people don't feel a lot of code was introduced without
discussion/documentation.
The enhancements I would like to add lead to a scenario in which:
- The Gadget object maintains and can be manipulated by a parse tree
(HTML/CSS/JS) of its Content in addition to a String representation.
- A parse tree is generated by HTML/CSSParser objects.
- A ContentRewriter implements one method: rewrite(...) (method given
full request context and returns caching metadata)
- Multiple ContentRewriters can be activated by a given rendering server.
- Rewriting can either be forced (no developer opt-in) or
signaled/configured by a <Require> directive.
This work involves *no* change to the semantics of the current content
rewriter, which simply becomes the default rewriter "module" provided by
Shindig.
My ask from the community is for any and all feedback on this idea, in
particular discussion on the overall approach (as I'm keen to get to
coding). I admit this proposal is lengthy and reasonably ambitious, so
thanks for reading (if you do! :)).
*
Motivation/Rationale*
This proposal effectively enables the renderer to become a multi-pass
compiler for gadget content (essentially, arbitrary web content). Such a
compiler can provide several benefits: static optimization of gadget content
(auto-proxying of images, whitespace/comment removal, consolidation of CSS
blocks), security benefits (caja et al), new functionality (annotation of
content for stats, document analysis, container-specific features), etc. To
my knowledge no such infrastructure exists today (with the possible
exception of Caja itself, which I'd like to dovetail with this work).
While any of the benefits here could be achieved through direct injection of
code into the rendering code path, doing so both A) ties the new semantics
to sundry details of the current implementation and B) makes testing
difficult due to a lack of a clear boundary between the new rewriter and the
renderer. Together these raise the barrier of entry for those wishing to add
new rewriting capabilities, and make the code harder to maintain.
Rewriting is conceptually simple: modify web content (HTML/JS/CSS) given
some context (query params, gadget metadata). This proposal seeks to reflect
that in a clear API.
*Details*
I would like to make the following changes to Shindig [Java impl], in the
order specified. Each numbered point corresponds with one or more CLs of
reasonably modest size. By breaking up changes into manageable parts,
testing and verifying existing behavior along the way, I hope to avoid the
overhead associated with introduction of a new branch, though I'm open to
the idea if needed.
1. Make the Gadget object's Content that of the active View.
Discussion: Code manipulating a Gadget should have ready access to the
Content that applies to the request. Setters should apply to the Gadget
object rather than View, since Gadget is the intermediary representation for
all processing state, while View is metadata that feeds into that. A
corollary benefit to this change is that new GadgetSpecFactory instances can
be provided that only fetch a spec and don't reimplement rewriting (or,
don't have to subclass BasicGadgetSpecFactory).
a. Inject ContainerConfig into GadgetServer.
b. Construct Gadget using ContainerConfig object.
c. Rename Gadget.getView(ContainerConfig) to Gadget.getCurrentView().
d. Add Gadget.getContent(), Gadget.getRewrittenContent(), and
Gadget.setRewrittenContent(), while removing same from View.
e. Migrate rewriter functionality out of BasicGadgetSpecFactory.getSpec()
and into GadgetServer.createGadgetFromSpec().
2. Define an abstract gadget Content parse tree and the HTMLParser that
generates it.
Discussion: Most truly interesting content manipulation is done to a parse
tree, as the Content Rewriter does with Caja's today. Such a parse tree
should be a first-class concept. It's essentially DOM-lite, but not DOM
itself for A) simplicity and B) ability to extend to gadgets-specific
concepts. It should attach to the Gadget object so that it can be kept in
sync with the equivalent String representation of content. It should also be
abstract to separate it from unnecessary dependencies and, more importantly,
to allow competing (ever-improving) parser implementations to be
transparently shifted in as they're available. This change purely introduces
new concepts and doesn't integrate with processing code.
a. interface GadgetContentParseTree
i. GadgetElement getBaseElement() - returns parsed base element of
Gadget contents
b. interface GadgetElement
i. List<GadgetElement> getChildren() - returns ordered list of children
ii. boolean isText() - returns true if element contains only text
content (in the DOM sense, contains only 1 child of type TEXT)
iii. String getText() - returns text content if (isText()), otherwise
null
iv. void setText(content) - sets text content if (isText()), otherwise
[does nothing || replaces all child text nodes with one text node at the end
with content]
v. List<GadgetAttribute> getAttributes() - returns unmodifiable list of
tag's attributes
vi. String getAttribute(name) - Returns attribute with the given key, or
null if not present
vii. void setAttribute(name, val) - Sets attribute with the given n/v
viii. GadgetCssParseTree getCss() - Returns parsed CSS for contents of a
<style> block, or for an a style="" attribute
c. interface GadgetCssParseTree
i. boolean isInline() - returns true if content was parsed from inline
style block
ii. List<GadgetCssRule> getRules() - Returns an ordered list of CssRules
if (!isInline()), null otherwise
iii. List<GadgetCssAttribute> getAttributes() - Returns an ordered list
of CssAttributes if (isInline()), null otherwise
d. interface GadgetCssRule
i. List<String> getSelectors() - returns ordered list of selectors for
this CSS rule
ii. List<GadgetCssAttribute> getAttributes() - returns ordered list of
CssAttributes for the rule
e. interface GadgetCssAttribute
i. String getKey() - returns key of the attribute
ii. String getValue() - returns value of the attribute
f. interface GadgetContentParser
i. GadgetContentParseTree parse(content) - Parses content into a parse
tree
g. interface GadgetCssParser
i. List<GadgetCssRule> parse(content) - Parses CSS rules from a string
ii. List<GadgetCssAttribute> parseInline(content) - Parses inline style
block contents
3. Implement GadgetContentParser and GadgetCssParser using Caja (existing
impl).
Discussion: The existing content rewriter uses Caja's HTML and CSS lexers
for processing gadget contents. This change hides this functionality behind
the newly defined interfaces. Shindig should provide an out-of-the-box
default for each interface, and these are likely candidates. The Caja-based
CSS parser will probably be sufficient in perpetuity given CSS's much more
limited acceptable syntax, but I'd expect (and would like) other
GadgetContentParser implementations to surface that are based on Mozilla or
WebKit's much more lenient HTML parsers, ensuring that the server "sees"
what a browser would.
a. Implement CajaContentParser using Caja's lexer.
b. Implement CajaCssParser using Caja's lexer.
4. Improve and augment the Gadget object to maintain two forms of its
rendering state.
Discussion: A Gadget should be able to maintain its current state in two
ways: parse tree and String. Manipulation of either should result in changes
to the other. Without doubt, it will be optimal to ensure the fewest number
of conversions between the two, but behavior should be consistent in any
case. We should also clean up the API by removing the "rewrittenContent"
APIs, since rewritten content is just the current content having been
replaced. One important note is that the new rewriter programming model
supports only one form of rewriting at a time: parse tree or string. That
is, you can't manipulate the parse tree object after setting new content as
a String unless you retrieve a brand-new parse tree. This also underscores
that rewriters must run serially, which avoids a host of threading issues
anyway (and is consistent with current behavior).
a. Remove Gadget.getRewrittenContent() and Gadget.setRewrittenContent(),
while adding Gadget.setContent(). Newly-set content is considered
"rewritten" if it changed.
i. Update the rest of the implementation to use the new method (points
a. and a.i. reflected in separate CL).
b. Add GadgetContentParser and GadgetCssParser to Gadget during
construction (each of these injected by Guice into GadgetServer).
c. Add Gadget.getParsedContent().
i. Returns a mutable GadgetContentParseTree used to manipulate Gadget
Contents.
ii. Mutable tree calls back to the Gadget object indicating when any
change is made, and emits an error if setContent() has been called in the
interim.
iii. setContent() calls back to Gadget if new content has been set.
iv. getContent() serializes GadgetContentParseTree if modified, and vice
versa.
5. Simplify ContentRewriter interface to include only one method:
RewriteResults rewrite(GadgetContext, Gadget).
Discussion: The idea here is to simplify the rewriting API to its
essentials: given some context, manipulate a Gadget in some way. Then, tell
the server the properties of what was done: did rewriting happen? If so, did
it modify caching characteristics of the gadget? And so on. The
infrastructure takes care of the rest. The implementation of RewriteResults
in particular is designed to allow new "signals" to be introduced in
backward-compatible way should any become relevant.
a. Remove existing ContentRewriter interface methods, replacing with:
RewriteResults rewrite(GadgetContext, Gadget).
b. Add GadgetContext.getOptions() returning protocol-agnostic versions of
inbound request options.
i. Separate out Options from HttpRequest as precursor.
c. Define RewriteResults.
i. boolean wasRewritten() -- returns true if some modification occurred
ii. cache control stuff may be added in the future as needed
d. Update DefaultContentRewriter to use the new interface
(DefaultContentRewriter still uses inline Caja impl).
e. Update ContentRewriter calling code (in GadgetServer,
AbstractHttpCache, et al) to reference the new interface.
6. Update DefaultContentRewriter to use new Gadget syntax tree APIs.
Discussion: The default content rewriter should utilize the new
infrastructure put in place for robust HTML and CSS parsing to duplicate
code and demonstrate the new APIs. Rather than implementing atop an HTML
tokenizer, direct manipulation of the DOM-lite is used.
a. Update DefaultContentRewriter and associated apparatus (LinkRewriter,
HtmlRewriter et al) to directly manipulate the parse tree.
b. Update all tests to ensure no change in semantics has occurred.
7. Provide helper base classes for ContentRewriters of various types.
Discussion: Several patterns are likely to emerge with rewriters, notably
regarding when they apply. These classes provide helpers for new rewriters
while helping to clean up DefaultContentRewriter's ContentRewriterFeature
stuff, and helps simplify testing somewhat.
a. Define FeatureKeyedContentRewriter.
i. Passes in Feature params to augmented rewrite(...) method.
ii. boolean requires_feature supplied to constructor, indicating whether
a <Requrie> or <Optional> block must be present for the rewriter to be
active.
b. ParamKeyedContentRewriter
i. Activates rewriter based on request param info
c. Make DefaultContentRewriter a subclass of FeatureKeyedContentRewriter
with requires_feature=true.
8. Modularize ContentRewriter with ContentRewriterRegistry.
Discussion: At this point rewriters can be written in isolation from one
another, independently tested, verified, and experimented. A registry makes
this infrastructure a general extension mechanism for Shindig installations.
The registry tells the server what rewriters are supported, and each is
applied to requests that come in. It's essentially a way to generate a
composition of rewriters, and is itself injected into the server via Guice.
a. interface ContentRewriterRegistry
i. List<ContentRewriter> getRewriters() - mutable list of
ContentRewriters.
b. Implement DefaultContentRewriterRegistry, which registers only
DefaultContentRewriter.
c. Inject DefaultContentRewriterRegistry via Guice.
d. Switch out use of single ContentRewriter with ContentRewriterFactory in
GadgetServer code.
If you've made it this far, thanks for reading. As aforementioned, I'm
interested in any and all feedback. I'll be tracking my progress, barring
any fundamental issues, in JIRA as well. In any case I fully expect some
details of this to change (particularly the abstract parse tree
representation), but think this impl will serve as a good base to build on
for cool new functionality.
Thanks!
John