Some random ideas around (broken) XML

Karl Dubost Tue, 17 Nov 2009 19:18:23 -0800

These are random notes about XML from another time and space.
original mail modified from 2008-07-12


the XML specification, says
http://www.w3.org/TR/REC-xml/#sec-terminology

        fatal error

        [Definition: An error which a conforming XML processor
        MUST detect and report to the application. After
        encountering a fatal error, the processor MAY continue
        processing the data to search for further errors and
        MAY report such errors to the application. In order
        to support correction of errors, the processor MAY make
        unprocessed data from the document (with intermingled
        character data and markup) available to the application.
        Once a fatal error is detected, however, the processor
        MUST NOT continue normal processing (i.e., it MUST NOT
        continue to pass character data and information about
        the document's logical structure to the application in
        the normal way).]


Could we interpret this set of rules in this way?

Context: A non well-formed document is sent to an applicationcontaining an XML processor.

1. The XML processor detects that the document is not well-formed andreport it to the application.2. The XML processor continue the processing of data and report dataand errors to the application.3. The XML processor delivered a character stream with identifiedbroken information to the application4. The application applies an XML recovery mechanism on the streamsent by the XML processor and do what it wants with it such asdisplaying the document if necessary.



Some preliminary observations:

* XML on the Web (HTTP environment) is very very small.
* XML on the desktop, mainframe, back-ends is common.
* XML vocabularies are powerful in a controlled environment
  (ex: Docbook, data transfer in banking, etc.)
* XML used on the Web is often tortured, broken.

* Many Web developers do not understand XML beyond the notion of well-formed.



Understanding XML conformance and processing to find strategies for

1. Fixing broken XML on the Web
2. Improving the ecosystem

The Web is a highly distributed environment with loose joints.*Socially* it has a lot of consequences. A good example of XML used onthe Web is Atom. The language has been designed from scratch by strongXML advocates as chairs (Tim Bray and Sam Ruby). It was clean withoutbroken content at the start. It is used by a very large community ofpeople and tools (consumers AND producers). The language has beendeveloped in a test driven way. Most of the implementers who matter inthe area were inside the group implementing and testing at the sametime it was developed.


# PRODUCING BROKEN XML

The fact is that many atom feeds are broken for many reasons.

* edited by hand
* created by templating tools which are not XML producers

* mixing content from different sources (html, db, xml) with differentencodings

It means when designing an atom feed consumer, implementers are forcedto recover the broken content to be able to make it usable by thecrowd (social impact). Second part of the postel laws "Be liberal inwhat you accept".

Integrity of the data is lost. But the cost/benefit between integrityloss/usability is higher on the usability side in the atom case.


Does it show that *authoring rules* are usually poorly defined?
We defined what must be a "conformant document",

then we think, a "conformant producer" is a tool which produces"conformant document".

But in the process we forget about authoring usability.

        Example 1:

With an *XML* authoring tool, I create a document where I type markupby hand.

        The tool has an auto-save mode.

I type "<foo><bar" then auto-save the document is already non wellformed on the drive.

        It should not be an issue as long as the final document is well-formed.
        Though how do we define "final save"?
        There is an issue. And we have very often to modify document
        or to have temporary non well formed document.
        (not even talking about validity.)

The example 1 was using an XML authoring tool, which is already a bigstep for writing a document. Many XML documents are produced fromtemplating languages, sometimes in the code itself, sometimes a filewith variable substitution. Some of these languages have not beendesigned to be self well-formed (non XML constructs which will besubstituted.)These are possible sources of creating broken XML either by thetemplate being wrong and/or the variable substitution.

What are the requirements for creating better tools able to outputgood XML content?

Something easy to integrate in a workflow, authoring libraries, etc.



# CONSUMING BROKEN XML

Then there is broken XML on the Web, a lot.
How do we improve the ecosystem? How do we repair?

Being too strict has usually two *social* effects:

* people avoid to use it at all. Go to another language: JSON, HTML,etc.* People find non standard recipes to recover the content: noninteroperable recovering parsers.


If the recovering mechanism was well defined, it would help:

1. to create more well formed (sometimes valid) XML content.

2. to develop application with strict parsers (some applications wouldbe more willing to go XML because less content would be broken.)

The overall effect would make XML easier to use for people (goodkarma) and would create more XML documents on the Web.




# INTEGRITY OF XML DOCUMENTS

A recovered document MIGHT have lost its intended data integrity.

Why not having a mechanism to flag content which has been recoveredsuch as:

* an xml attribute on the root element, i.d. xml:check="recovered" orsomething similar.

* or an xml PI

It warns people and processors that the information may contain poordata. It helps to design grass roots quality control mechanisms. Theinformation is visible *in* the document, not outside.



--
Karl Dubost
Montréal, QC, Canada
http://www.la-grange.net/karl/

Some random ideas around (broken) XML

Reply via email to