Re: Corinthia Document Model (was RE: ODF filter)

Peter Kelly Thu, 08 Jan 2015 09:58:57 -0800

> On 9 Jan 2015, at 12:02 am, jan i <[email protected]> wrote:
> 
> Without polluting with all the function calls, let me try to explain, how I
> see the current source (peter@ please correct me if I am wrong).
> 
> a filter can in principle inject any HTML5 string into the datamodel. Core
> delivers functions to manipulate the HTML5 model, but does not control what
> happens.
> 
> Meaning if a filter wants to write "<p style=janPrivate,
> idJan=nogo>foo</p>" to the data, it can do that. The problem with that is
> that all the other filters need to understand this, when reading data and
> generating their format.


Just to clarify on the representation - it's a DOM-like model, in that we have 
a tree data structure with nodes (elements and text nodes), where elements can 
have attributes. It's very similar to the W3C DOM but some of the function 
names and field names are different, and it doesn't use inheritance (due to C 
being the implementation language). There is no string concatenation going on 
during conversion - the DOM tree is parsed and serialised to XML or HTML in the 
standard fashion.

> 
> My idea is that core should provide function like (just an example)
>   addParagraph(*style, *id, *text)
> Doing that means a filter cannot write arbitrary HTML5 but only what is
> "allowed". If a filter need a new capability, core would be extended in a
> controlled fashion and all filters updated.

One approach - admittedly radical (but don't let that stop us) - is to enforce 
this at the level of the type system, based on the HTML DTD, as well as 
possibly the XML schema definition for the individual file formats. 
Unfortunately however, C's type system isn't really powerful enough to express 
the sort of constraints we'd want to enforce; Haskell is the only language I 
know of which is.

The parsing toolkit I'm working on (based on PEG - see 
http://bford.info/packrat/) takes a grammar as input and produces a syntax tree 
(currently in a custom data structure, but could easily produce the syntax tree 
in XML or similar). I'm interested in taking this idea further, and making the 
grammar and type system one and the same, and use this to define a high-level 
functional language in which transformations could be expressed. Things like 
union types are really important here, which Haskell does well but few other 
languages, but the concept of union types has been alive and well in formal 
grammars since the beginning - that is, multiple different possible ways of 
matching a given production.

I've worked a lot with Stratego/XT (http://strategoxt.org) in the past and have 
been inspired by it's unique to approach to expressing language 
transformations. I think something like this would be very well suited to what 
we want to do. My main problem with Stratego however is it's untyped; you can't 
enforce the restriction that a particular transformation results in a 
particular type/structure, nor can you specify the types of structure it 
accepts. I think a language that merges the concepts of stratego's 
transformation stategy, haskell's type system, and PEG-based formal grammars 
would be a very powerful and elegant way to achieve our goals.

My primary motivation for using formal grammars is to give us the ability to 
handle non-XML based languages, such as Markdown, RTF, LaTeX etc. With a 
suitable parser implementation, we can deal with these just as easily as we can 
with any other XML-based structure - and in fact we could even move to a higher 
level of abstraction where XML is just a special case of the more general type 
system. XML Schema and Relax NG (used for the OOXML and ODF specs respectively, 
if I remember correctly) could also be used as inputs to the type system, and 
used for static typing.

A programming language of this nature would allow us to formally specify the 
exact nature of the intermediate form (be it a dialect of HTML or otherwise), 
and get static type checking of the transformation code to a degree that can't 
be achieved with C/C++ or other similar languages. More static type checking 
also has the potential to reduce the number of required testcases, as we can 
eliminate whole classes of errors through the type system.

>>  This relates to how inter-conversion is to be tested.  Is there some
>>  abstraction against which document features are assessed and mapped
>>  through or are we working concrete level to/from concrete level and
>>  that is essentially it?
>> 
> I dont think we should test inter-conversion as such. It is much more
> efficient to format xyz <-> HTML5. And if our usage of HTML5 is defined
> (and restricted) it should work.

Agreed. Think of it like the frontend and backend parts of a compiler. If you 
want to support N languages on M CPU architectures, then you would generally 
have a CPU-independent intermediate representation (essentially a high-level 
assembly language). You write a frontend for each of the N languages which 
targets this intermediate, abstract machine (including language-specific 
optimisations). You also write a backend for each of the M target CPU 
architectures (including architecture-specific optimisations). You then need 
N+M tests, instead of N*M.

In our case, HTML is the "intermediate architecture", or more appropriately, 
"intermediate format". Each filter knows about it's own format (e.g. .docx) and 
HTML. It solely deals with the conversion between these formats.

If you want to convert from say .docx to .odt, then you first go through HTML 
as an intermediate step. So the file gets converted from .docx to HTML, and 
then from HTML to .odt.

—
Dr Peter M. Kelly
[email protected]

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: Corinthia Document Model (was RE: ODF filter)

Reply via email to