Lossless HTML template expansion

Kragen Sitaker Mon, 18 Apr 2005 00:38:15 -0700

I wrote this incomplete essay (design document?) in January.  It
describes an unimplemented technique for making server-side web
applications more transparent, inspired by TinyTemplate, TAL, Nevow,
and HTML::Template.  I wanted to finish it up before sending it out to
the world, but it's been three months now, so I thought I'd better
send it out soon.


Lossless template expansion
===========================

So suppose we tag certain elements in HTML with an attribute that
marks them for replacement with a value: t:src="title", or what have
you.  Then we can write ...<head><title t:src="title">Sample
title</title></head><body><h1 t:src="title">Sample title</h1>... and
have our template be a well-formed, valid XHTML document, with "Sample
title" standing in for the actual value of the "title" variable.
Cool, huh?  This is the idea behind Tiny Template.  (The t: stands for
an XML namespace.)

Now suppose that we leave these attributes in the rendered HTML,
rather than stripping them out upon rendering (as Tiny Template does).
Now our formatted HTML document contains all the semantic information
about the variables that was in the original template --- in fact, you
can use the formatted HTML document as a template, and it's
semantically equal to the original template.  The content and the
formatting have become completely orthogonal and the operation of
combining them has become lossless!

The sacrifice is that some validators will complain that there are
attributes from some foreign namespace in your HTML, but web browsers
are guaranteed not to care.

Sequences and attributes
------------------------

Extending this to cover the necessities of a practical HTML templating
language is a little tricky.  Tiny Template allows the code generating
the template replacement values to generate different kinds of values
that have different effects; by default, you generate a
content-replacement value, but there are attribute-replacement values
and repetition values as well.  In order to be able to extract these
values from the rendered template, the information about which
attribute or attributes are being replaced, and what is being
repeated, must live in the template itself.  I suggest the following
solutions:

- for sequences:
  - as in HTML::Template, a page context is a set of name-value pairs, where 
    each value is either a string or a sequence of page contexts;
    sequences of strings are not allowed.
  - an element that corresponds to a sequence of page contexts is
    repeated once per page context, and is tagged t:for="varname" rather
    than t:src="varname".  Elements within it are applied to the page
    contexts within the sequence rather than the context the sequence
    belongs to.
  - in the interest of readability, whitespace following that element
    is also repeated.
- for attributes:
  - define a t:dest attribute that names the attribute being replaced,
    e.g. <a href="http://example.com/"; t:src="url" t:dest="href">See
    other.</a>.
  - define t:src2, t:dest2, and so on up through 4 or so, to handle
    multiple replacements in the same attribute, e.g. <a
    t:src="linktitle" t:src2="url" t:dest2="href"
    href="http://example.com/";>Example.com</a>.

So now we have t:src, t:dest, their numbered mates, and t:for.  So you
can write, for example:
    <table><tr><th>Name</th><th>Phone</th></tr>
    <tr t:for="person"><td t:src="name">John Smith</td><td
    t:src="phone">555-1212</td></tr>
    <tr t:for="person"><td t:src="name">John Smith</td><td
    t:src="phone">555-1212</td></tr>
    </table>

And have it render nicely.

Now we have a relatively full *lossless* HTML templating system, as
long as none of your sequences are empty; it's only missing
conditionals, and conditionals are incompatible with losslessness
anyway.  (You can achieve the effect with an empty sequence.)

Round-trip editing
------------------

Losslessness means that you could, for example, download the rendered
page, edit it, and re-upload it, and have the guy at the other end
understand the structure well enough to update both the template and
some back-end database so that the page, when next rendered, will look
like the version you uploaded; so you can do WYSIWYG editing of
database contents with, say, Dreamweaver.

DHTML user interface
--------------------

You could imagine DHTML that helps with the editing: 
- click on a named field to edit its value
- each sequence item has also
  - a "+" symbol which duplicates it when clicked
  - a "-" symbol which deletes it when clicked
  - "<" and ">" symbols which rearrange the sequence
- clicking on any HTML element brings up an edit box for its contents,
  whether it's database-backed or not; this lets you edit your
  template as easily as you can edit the data poured into it.  The
  edit box includes a button to click to move the editing focus to its
  parent element.
- a button to save the edited version as a new page, rather than
  updating the original (as a bookmarklet, this could also work with
  arbitrary HTML pages from other sources entirely)

HTML as canonical format
------------------------

Losslessness also means there doesn't need to be a back-end database
at all; you can simply store the "rendered" form of everything as
HTML, and parse the HTML when you want to extract the actual data.
To make this practical, we need to solve five major problems:
- list/detail views: embedding data from one page into another
- editing a template that applies to many pages (each representing a
  database record of some kind)
- querying the collection of data
- handling syntactic problems (like ill-formed XML, mismatched
  sequence items), semantically dangerous operations (like deleting
  a field), and semantic errors (like a field with no value).
- an expression language for reports

I will eventually get around to explaining how to solve these.

Document fragments
------------------

Notice that the values of all variables so far are either sequences,
or X(HT)ML document fragments, rather than strings.  A document
fragment is like an XML document, except that it doesn't contain PIs,
XMLPIs, or the tags around the root element.

URLs as variable names
----------------------

You can use a URL in place of a variable name wherever a variable name
is required: in t:src and t:for, so far.  The value of the URL is a
(possibly cached) resource GETted from it, either as an XML
document-as-document-fragment if possible, or as a string of text.

The value of a URL, without a fragment identifier (you know, #sec1),
being used as a variable in t:src is the entire resource; if there is
a fragment identifier that identifies some particular element, the
value of the URL is the entire element, including its enclosing tags.

If the URL is being used with t:for instead of t:src, the value of a
URL with no fragment identifier is a sequence containing a single page
context, containing the contents of that page.  (In the context of
t:for, the value of a URL with a fragment identifier is irrelevant
because it's not useful; except, as explained later, when the fragment
id is actually a variable name.)  This makes it possible to put old
content wine in new template bottles.

I'm not sure what to do if someone edits the content of a variable
from another document.  Should we try to propagate the edit (and if
so, with what credentials?) or should we just discard the change,
possibly with some warning?

Variable names as fragment identifiers in URLs
----------------------------------------------

You can use a variable name inside of the document as a fragment
identifier in the URL.  If we were being pure, foo.html#walnuts should
refer to the element whose id is walnuts, not whose t:src is walnuts,
since there might be several elements with the same t:src.  But we're
not being pure.

So if we have "<span id='a' t:src='b'>wiggle</span>" in c.html, then
(for t:src) the value of c.html#a is "<span id='a'>wiggle</span>", and
the value of c.html#b is "wiggle".

For t:for, normally the fragment identifier identifies another t:for
in the source documents, and its value is the sequence of page
contexts in the rendered output of that original t:for.  (I don't
think it makes sense to refer to a t:src variable in a t:for, whether
that variable is in the same page or not.)  This makes it possible to
reformat small bits of other documents.

XPath expressions as fragment identifiers (strike this?  What's the point?)
-----------------------------------------

You can use an XPath expression preceded by a slash as the fragment
identifier instead, to facilitate access to data that isn't actually
stored in a defined field.  An XPath expression evaluates to a set of
nodes, rather than a particular node; I don't know that there's a
really aesthetically pleasing way to handle sets of non-unity
cardinality in this case in general.  Perhaps the nodes could be
concatenated when used in t:src, but wrapped appropriately for a
t:for.  By and large I don't think this is useful because it's an
insufficiently effective way of extracting interesting parts from other
people's documents, and an insufficiently simple way of extracting
interesting parts from your own documents.

Template overriding
-------------------

To edit a template that applies to many pages, we have a couple of
possibilities.  We could either have some kind of global
search-and-replace system that lets you select exactly which pages you
want to edit the templates of, whenever you change a template, or we
could make it possible to have many pages refer to a template that's
stored in a single other page.  I choose the second choice.

The low-level mechanism for this is another attribute, t:template,
whose value is a URL from which to get the template for the element
it's attached to.  So this fragment:

  John <b t:src="surname" t:template="#position" id="crap">Smith</b>
  <font color="#f70" t:src="position">President</font>

we emit instead:

  John <font color="#f70" t:src="surname" t:template="#position">Smith</font>
  <font color="#f70" t:src="position">President</font>

Note that this is not lossless in that it loses some information in
the original template, namely that "Smith" was originally in bold, but
it is lossless in that the round-trip modification is idempotent.

If the URL for the template cannot be fetched, it is relatively
harmless to continue to use the template it would be replacing --- in
the above example, the bold.  (Although since that's an intra-document
anchor, the fetch could only fail if the variable name were
misspelled.)  After an edit round-trip, the markup in the template
will match the markup from the external template, so it serves as a
sort of cache in case the external template can't be fetched.

I'm not sure whether this cache should be updated automatically
whenever you view a page; it would certainly be useful to have a batch
job to do this.

When you save an edited page as a new page, if you haven't changed the
templates, the root element of the new page contains a t:template
reference back to the original page.  When you save an edited page as
itself, the edits may conflict with some referenced template; I think
that, in this case, the t:template reference should be removed, but
there should be an easy way to put it back.

I also think that you should be able to apply an effective t:template
attribute to a page as a URL argument.  In particular, I think there
should be a debugging template that can be applied to any page.

If the referenced template has a t:dest attribute, it is used, as are
any non-template attributes, and it is also used.  That is, this:

  <a t:src="xref" class="xref" t:dest="href" 
     href="http://example.com";>q.v.</a> ...
  <span class="xref" t:src="xrefn"
     t:template="#xref">http://r2d2h2g2.example.org</span>

should render as this:

  <a t:src="xref" class="xref" t:dest="href" 
     href="http://example.com";>q.v.</a> ...
  <a t:src="xrefn" class="xref" t:dest="href" 
     href="http://r2d2h2g2.example.org";
     t:template="#xref">q.v.</a>

and not as this:

  <a t:src="xref" class="xref" t:dest="href" 
     href="http://example.com";>q.v.</a> ...
  <a t:src="xrefn" class="xref"
     href="http://example.com";
     t:template="#xref">http://r2d2h2g2.example.org</a>

If the t:template attribute appears on an element without a t:src or
t:for attribute, then neither should the top element of the template
to which it refers, and that template is applied to the current
document context.

If t:template has a fragment identifier that is an element id, the
value of that template is the entire element identified by that
element; and if its fragment identifier is a variable name, then the
value of that template is the first element that has that variable as
t:src or t:for.

Indirection
-----------

You could argue that this should include the contents of the document
named by the variable "url": 

  <p t:src="url" t:dest="t:template" 
t:template="http://oldvalue.example.com/#disclaimer"; />

But this has a couple of problems:
- It requires t:template to be the value of an attribute, and that's
  usually not particularly simple to implement if t: is really an XML
  namespace and you're using an XML parser that handles XML
  namespaces for you.
- as specified earlier, t:dest from the source template gets copied
  over and used; that breaks this template's round-trip-ness.

So instead I'm going to add another attribute: rather than specifying
the name of the variable to get the replacement value from, the way
t:src does, it is called t:embed and it specifies the name of the
variable in which the value for t:src comes.  The effect of
t:embed="walnuts" is very similar to t:src2="walnuts" t:dest2="t:src",
except that it works.  In particular, the value of "walnuts" can be
found in t:src.  (Normally, in this case, you'll want to make sure
walnuts has a fragment identifier in it, or you'll get the whole
thing!)

So we write the above example as:

  <p t:src="http://oldvalue.example.com/#disclaimer"; t:embed="url" />

This doesn't handle double indirection in any particularly good way,
so maybe I should use a formula/expression instead?

Now, if an element has t:for, it can't also have t:src, so t:for with
t:embed can safely have a slightly different meaning: replace the
value of t:for, rather than the value of t:src.  This allows
reformatting of documents referenced indirectly.  (Uh-oh, t:for does
need to be able to have t:src; see the section about t:span for
details.)

It might also be desirable to have some way to indirect t:template and
t:pattern (see below) URL references through a variable, and t:embed
doesn't quite seem to reach that --- maybe you should just say:

  <p t:src="$url=http://oldvalue.example.com/#disclaimer"; />

(You need the value since 'url' might or might not be mentioned
anywhere else in the page.)  Then you ......

Classes as fragment identifiers  (strike this?)
-------------------------------

Modern HTML normally uses the 'class' attribute to describe the
semantics of elements, in some vocabulary specific to the particular
application.  As a convenience, you can use a class name, preceded by
a dot ".", as a fragment identifier.  In the context of t:src, the
value of this fragment is the *content* of the first element of that
class; in the context of t:for, I am not sure what its value should
be; and in the context of t:template, the value of this fragment is
the first element of that class, including both the content and the
enclosing tags.

This is not as useful as I hoped it would be because it doesn't
provide a useful way of accessing data that is not unique within a
document.

Pattern-matching
----------------

Often there are external documents we would like to parse, as if they
had been rendered by some template, then had the template attributes
removed.  Given the original template, perhaps created by a human
being editing a similar page to add the template attributes, we'd like
to recover the variable values.

If the original page exactly matched some possible rendering of the
specified template, this is mostly a solvable problem; it's always
possible to produce some set of variable values, and the only problem
is that there might be more than one, in the unusual case that there
are two identical adjacent t:for elements with nothing between them.

But that's a much simpler problem than the ones we encounter in the
real world, where we're trying to recover data from pages that get
reformatted by other people without warning.  So I propose a looser
method of matching.

I'll use the word "pattern" for the template we're trying to match the
foreign page against.

>From the pattern, for each element that substitutes a variable, we
extract a set of features:
- element name
- content
- various prefixes of content: first character, first two characters,
  first word, first subelement
- similarly, various suffixes of content
- names of subelements
- element attributes
- text before
- text after
- for table cells, the content of the cell at the top of the column
  and the beginning of the row, and the index of the column
- for <dd>, the content of the corresponding <dt>
- all of the above for previous and following siblings
- all of the above for each ancestor element, as both "nth ancestor"
  and "some ancestor"
- all of the above for each element that links to this element

If the substituted element is wrapped in a t:for, it will occur more
than once, and features that do not match for all of the instances of
that element are dropped.  If there are other elements that match the
features for some substituted element just as well as then substituted
element itself does, but that is not substituted (or is substituted
with a different variable), we have a potential problem, and it needs
to be possible to find out about it somehow.

It may be necessary to allow multiple pages to constitute the same
template, in order to prevent spurious variations from getting used as
identifying features for variables that occur only once per page.

Now, when we try to match some foreign page against the pattern, for
each variable in the pattern, we look for the element (or, for t:for,
sequence of elements) in the foreign page that matches the largest
number of features from some element in the template that substitutes
that variable, and use the specified part of it.

This should allow a fairly large degree of variation in the matched
page without breaking the pattern matching.

To specify that a page should be matched against some pattern, instead
of using whatever t:attributes might or might not be embedded in it,
we use the t:pattern attribute to name the template to use as the
pattern.

It might be worthwhile to allow multiple possible variations for a
particular variable, which don't necessarily have to match; for
example, there might be lists containing two different kinds of things
intermixed indiscriminately.

Inheritance
-----------

Suppose you render a page context containing no variables using a
template that has some variables.  What do you use for the variable
values?

The simplest answer is that you do not substitute those values --- you
use the values from the template.  This allows you to change the
default value of the variable in the template and have the change be
reflected in any pages that haven't been edited since the variable was
added.

Ideally, you'd like the values to remain dependent on the template
values until that particular value is edited, rather than simply until
the entire page is edited.  To provide this function, there is a
t:inherit attribute whose presence specifies that the value of the
variable is inherited from the template rather than being supplied by
the page.  This attribute should be added by default when saving a
page as a new page, and automatically removed whenever a page save
updates the variable's value.  This requires keeping around the
original value of the variable from the template until the edited page
is saved; the simplest place to keep it is in the t:inherit attribute
itself.

Collections
-----------

Formulas and Queries
--------------------

Sometimes you need something more complex than a simple full-content
interpolation or pulling out existing named variables.  For example:
- URL composition from a base and a relative URL
- counting the number of items in a list
- iterating over only the first few items in a list: the first ten
  items of search results, the first paragraph of a blog post
- summing a column
- A "No items found." message when a list is empty
- limiting the size of interpolated strings: the first 80 characters
  of an email body
- all the CSS selector stuff: first, last, if href contains 'images', 
  whatever
- interpolating URLs into JavaScript URLs
- replacing double newlines with <p> tags

I think the best solution for these is some kind of expression
language that can be evaluated to get document fragments or sequences
of document contexts, and that uses variable values as its inputs.
Other possibilities include constraint languages, imperative
languages, and appeals to external REST services.

This leaves me with two reasonable options: either allow arbitrary
expressions of the expression language in place of variable names,
with variable names just being a common special case, or specify rules
for calculating variable values out-of-line:

    <t:formula name="cssurl" src="absolute_url(base, 'style.css')" />
    ...
    <link rel="stylesheet" t:src="cssurl" t:dest="href" href="foo" />

This is more verbose, but allows for a layer of abstraction, and also
allows the code to be separated from the presentation.

Probably the expression language should be JavaScript, since it's a
reasonably nice language and already widely known among the folks who
would use this.

t:span
------

Suppose I want to write a template that extracts the words in some
foreign HTML document, so that I can do some kind of operation on
them, like display them one at a time with JavaScript, or count them,
or whatever.

To do this, I need to be able to build a template that matches bits of
text that aren't HTML elements in the source documents.  So I define a
tag <t:span>, which has the same purpose as <span> in HTML: it merely
marks a section of text and allows attributes to be attached to it,
without itself implying any semantics.

This is also necessary for some other scenarios. 

If formulas are expressed...

Queries
-------

URL parameters
--------------

Summary
-------

A variable can contain either a document fragment or a sequence of
page contexts; some variables contain both.

Attributes:
    t:src replaces the content of an element with the
    document-fragment value of the specified variable.

    t:dest specifies that t:src should replace an attribute rather
    than the content.

    t:src2, t:dest2, etc., specify other replacements to perform on
    the same element.  Ugly hack.

    t:for repeats an element once for each value of a
    page-context-sequence variable, rendering its contents with the
    template contained in the t:for.

    t:template specifies that a particular element should use a
    template from somewhere else; if that template has a t:src or
    t:for attribute in its root element, then its value is not used,
    but the element with the t:template attribute must have the same
    attribute, either t:src or t:for, and inversely, if the template
    has no such attribute, neither should the element referring to it.

    t:embed is applied to an element that already has t:src or t:for,
    and indicates that that element's t:src or t:for value should come
    from the variable specified by t:embed, rather than the value in
    the template.

    t:pattern names a pattern template to use to parse the variables
    out of the specified source resource, so that you can extract
    semantic data from web pages not generated with this toolkit.

    t:inherit, applied to an element with t:src or t:for, specifies
    that the value found in the element itself should be overridden
    with whatever value is found for that variable in the external
    template; and, to facilitate the breaking of this inheritance link
    if someone edits this value and saves the page, it contains the
    previous value.  (Maybe it should only contain a checksum of it?)

Variables can be specified in many ways:
    A simple text string is just the name of a variable inside this
    file, and it normally has only one value, either document-fragment
    or page-context-sequenced.  At present I want to exclude
    punctuation, but not whitespace, from this syntax.

    A URL without a fragment identifier has, as its document-fragment
    value, the entire content of the named resource ('s representation
    as an entity), and as its page-context-sequence value, a sequence
    containing one page context: the context of that page.
    (t:template and t:pattern have URLs as their values, but those
    URLs are not being used as variables.)

    A URL with a fragment identifier can be interpreted in several
    ways, depending on the fragment identifier:

        If the fragid is the id of some element in the entity
        retrieved, then its document-fragment value is that element,
        and its page-context-sequence value is not yet defined.

        If the fragid is the name of some variable in the entity
        retrieved, as indicated by t:src, then its document-fragment
        value is the content of the element with the t:src attribute.

        If the fragid is the name of some variable in the entity
        retrieved, as indicated by t:for, then its
        page-context-sequence value is the sequence of page contexts
        from that t:for.

        I have uncompelling uses for XPaths and class names as
        fragids in variable names.

t:template and t:pattern have uses for URLs as ways of retrieving
templates, normally the entire retrieved entity.  In these URLs,
fragment identifiers specify that only a part of the retrieved entity
should be used, either the element with that id or the first element
(in the appropriate page context) with the specified variable name.

Plan for Implementation
--------------------------

I don't have a permanent plan for how to order the steps, but
obviously we won't have to finish everything before releasing
anything.  I'm thinking I could start this in Perl.  Here's a list of
some of the most crucial features, in a plausible order of
implementation, with rough estimates in abstract "points":

- parsing (plain) t:src variables out of an HTML file (3)
- finding t:src variables pointing to external URLs with fragment IDs 
  naming t:src variables in the other file (2)
- a command-line tool to update the external values in an HTML file by
  - parsing its t:src names out 
  - fetching the external data
  - interpolating the t:src variables into the template (1)
- parsing t:template URLs out of an HTML file (1)
- make command-line tool fetch external templates and update the file (2)
- some way to handle extra fields not mentioned in external template!
  probably becomes a generic error-reporting mechanism. (2)
- t:dest (1)
- some kind of handling of document fragments being shoved into attributes
  (drop tags?  Complain?) (1)
- t:for with a plain variable name (not a URL) (4)
- t:for with an external URL with a fragment ID naming a t:for
  variable in it (2)
--- first releasable point is here (19 points so far)
- t:template for t:for (may have to be implemented earlier) (2)
- some sort of on-the-fly rendering scheme, so the software runs on the web 
  server when you view a page (5)
- caching of fetched URLs (2)
- HTML form-POST-based page update (requires authentication) (3)
- minimal DHTML UI: upload current HTML to HTML form-POST-based page update (2)
- DHTML: integrate editnode bookmarklet? (1)
--- second releasable point is here (another 15 points)
- DHTML: make t:src fields editable with just a click (if authorized)? (2)
- DHTML: save edited version as new page (2)
- t:inherit (4)
  - set it on "save as new page"! (1)
--- third releasable point is here (another 9 points)
- some kind of indirection, maybe with t:embed as described (3)
- minimal t:pattern support, maybe just based on tag hierarchy (5)
- some way to debug pattern-matching (2)
- more t:pattern heuristics (8)
--- fourth releasable point is here (another 18 points)
- basic formula support (binding to JavaScript) (4)
- DHTML: add + and - buttons on t:for fields (if authorized) (3)
- some kind of query language (embedded in formula language or not) (4)
- support for URLs without fragment IDs as variable names (1)
- support for real fragment IDs (not t:src or t:for variables) (1)
- t:span (3)
--- at this point we have another 16 points; total is 77 points

Lossless HTML template expansion

Reply via email to