> On 9 Jan 2015, at 12:02 am, jan i <[email protected]> wrote: > > Without polluting with all the function calls, let me try to explain, how I > see the current source (peter@ please correct me if I am wrong). > > a filter can in principle inject any HTML5 string into the datamodel. Core > delivers functions to manipulate the HTML5 model, but does not control what > happens. > > Meaning if a filter wants to write "<p style=janPrivate, > idJan=nogo>foo</p>" to the data, it can do that. The problem with that is > that all the other filters need to understand this, when reading data and > generating their format.
Just to clarify on the representation - it's a DOM-like model, in that we have a tree data structure with nodes (elements and text nodes), where elements can have attributes. It's very similar to the W3C DOM but some of the function names and field names are different, and it doesn't use inheritance (due to C being the implementation language). There is no string concatenation going on during conversion - the DOM tree is parsed and serialised to XML or HTML in the standard fashion. > > My idea is that core should provide function like (just an example) > addParagraph(*style, *id, *text) > Doing that means a filter cannot write arbitrary HTML5 but only what is > "allowed". If a filter need a new capability, core would be extended in a > controlled fashion and all filters updated. One approach - admittedly radical (but don't let that stop us) - is to enforce this at the level of the type system, based on the HTML DTD, as well as possibly the XML schema definition for the individual file formats. Unfortunately however, C's type system isn't really powerful enough to express the sort of constraints we'd want to enforce; Haskell is the only language I know of which is. The parsing toolkit I'm working on (based on PEG - see http://bford.info/packrat/) takes a grammar as input and produces a syntax tree (currently in a custom data structure, but could easily produce the syntax tree in XML or similar). I'm interested in taking this idea further, and making the grammar and type system one and the same, and use this to define a high-level functional language in which transformations could be expressed. Things like union types are really important here, which Haskell does well but few other languages, but the concept of union types has been alive and well in formal grammars since the beginning - that is, multiple different possible ways of matching a given production. I've worked a lot with Stratego/XT (http://strategoxt.org) in the past and have been inspired by it's unique to approach to expressing language transformations. I think something like this would be very well suited to what we want to do. My main problem with Stratego however is it's untyped; you can't enforce the restriction that a particular transformation results in a particular type/structure, nor can you specify the types of structure it accepts. I think a language that merges the concepts of stratego's transformation stategy, haskell's type system, and PEG-based formal grammars would be a very powerful and elegant way to achieve our goals. My primary motivation for using formal grammars is to give us the ability to handle non-XML based languages, such as Markdown, RTF, LaTeX etc. With a suitable parser implementation, we can deal with these just as easily as we can with any other XML-based structure - and in fact we could even move to a higher level of abstraction where XML is just a special case of the more general type system. XML Schema and Relax NG (used for the OOXML and ODF specs respectively, if I remember correctly) could also be used as inputs to the type system, and used for static typing. A programming language of this nature would allow us to formally specify the exact nature of the intermediate form (be it a dialect of HTML or otherwise), and get static type checking of the transformation code to a degree that can't be achieved with C/C++ or other similar languages. More static type checking also has the potential to reduce the number of required testcases, as we can eliminate whole classes of errors through the type system. >> This relates to how inter-conversion is to be tested. Is there some >> abstraction against which document features are assessed and mapped >> through or are we working concrete level to/from concrete level and >> that is essentially it? >> > I dont think we should test inter-conversion as such. It is much more > efficient to format xyz <-> HTML5. And if our usage of HTML5 is defined > (and restricted) it should work. Agreed. Think of it like the frontend and backend parts of a compiler. If you want to support N languages on M CPU architectures, then you would generally have a CPU-independent intermediate representation (essentially a high-level assembly language). You write a frontend for each of the N languages which targets this intermediate, abstract machine (including language-specific optimisations). You also write a backend for each of the M target CPU architectures (including architecture-specific optimisations). You then need N+M tests, instead of N*M. In our case, HTML is the "intermediate architecture", or more appropriately, "intermediate format". Each filter knows about it's own format (e.g. .docx) and HTML. It solely deals with the conversion between these formats. If you want to convert from say .docx to .odt, then you first go through HTML as an intermediate step. So the file gets converted from .docx to HTML, and then from HTML to .odt. — Dr Peter M. Kelly [email protected] PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
