These are random notes about XML from another time and space.
original mail modified from 2008-07-12
the XML specification, says
http://www.w3.org/TR/REC-xml/#sec-terminology
fatal error
[Definition: An error which a conforming XML processor
MUST detect and report to the application. After
encountering a fatal error, the processor MAY continue
processing the data to search for further errors and
MAY report such errors to the application. In order
to support correction of errors, the processor MAY make
unprocessed data from the document (with intermingled
character data and markup) available to the application.
Once a fatal error is detected, however, the processor
MUST NOT continue normal processing (i.e., it MUST NOT
continue to pass character data and information about
the document's logical structure to the application in
the normal way).]
Could we interpret this set of rules in this way?
Context: A non well-formed document is sent to an application
containing an XML processor.
1. The XML processor detects that the document is not well-formed and
report it to the application.
2. The XML processor continue the processing of data and report data
and errors to the application.
3. The XML processor delivered a character stream with identified
broken information to the application
4. The application applies an XML recovery mechanism on the stream
sent by the XML processor and do what it wants with it such as
displaying the document if necessary.
Some preliminary observations:
* XML on the Web (HTTP environment) is very very small.
* XML on the desktop, mainframe, back-ends is common.
* XML vocabularies are powerful in a controlled environment
(ex: Docbook, data transfer in banking, etc.)
* XML used on the Web is often tortured, broken.
* Many Web developers do not understand XML beyond the notion of well-
formed.
Understanding XML conformance and processing to find strategies for
1. Fixing broken XML on the Web
2. Improving the ecosystem
The Web is a highly distributed environment with loose joints.
*Socially* it has a lot of consequences. A good example of XML used on
the Web is Atom. The language has been designed from scratch by strong
XML advocates as chairs (Tim Bray and Sam Ruby). It was clean without
broken content at the start. It is used by a very large community of
people and tools (consumers AND producers). The language has been
developed in a test driven way. Most of the implementers who matter in
the area were inside the group implementing and testing at the same
time it was developed.
# PRODUCING BROKEN XML
The fact is that many atom feeds are broken for many reasons.
* edited by hand
* created by templating tools which are not XML producers
* mixing content from different sources (html, db, xml) with different
encodings
It means when designing an atom feed consumer, implementers are forced
to recover the broken content to be able to make it usable by the
crowd (social impact). Second part of the postel laws "Be liberal in
what you accept".
Integrity of the data is lost. But the cost/benefit between integrity
loss/usability is higher on the usability side in the atom case.
Does it show that *authoring rules* are usually poorly defined?
We defined what must be a "conformant document",
then we think, a "conformant producer" is a tool which produces
"conformant document".
But in the process we forget about authoring usability.
Example 1:
With an *XML* authoring tool, I create a document where I type markup
by hand.
The tool has an auto-save mode.
I type "<foo><bar" then auto-save the document is already non well
formed on the drive.
It should not be an issue as long as the final document is well-formed.
Though how do we define "final save"?
There is an issue. And we have very often to modify document
or to have temporary non well formed document.
(not even talking about validity.)
The example 1 was using an XML authoring tool, which is already a big
step for writing a document. Many XML documents are produced from
templating languages, sometimes in the code itself, sometimes a file
with variable substitution. Some of these languages have not been
designed to be self well-formed (non XML constructs which will be
substituted.)
These are possible sources of creating broken XML either by the
template being wrong and/or the variable substitution.
What are the requirements for creating better tools able to output
good XML content?
Something easy to integrate in a workflow, authoring libraries, etc.
# CONSUMING BROKEN XML
Then there is broken XML on the Web, a lot.
How do we improve the ecosystem? How do we repair?
Being too strict has usually two *social* effects:
* people avoid to use it at all. Go to another language: JSON, HTML,
etc.
* People find non standard recipes to recover the content: non
interoperable recovering parsers.
If the recovering mechanism was well defined, it would help:
1. to create more well formed (sometimes valid) XML content.
2. to develop application with strict parsers (some applications would
be more willing to go XML because less content would be broken.)
The overall effect would make XML easier to use for people (good
karma) and would create more XML documents on the Web.
# INTEGRITY OF XML DOCUMENTS
A recovered document MIGHT have lost its intended data integrity.
Why not having a mechanism to flag content which has been recovered
such as:
* an xml attribute on the root element, i.d. xml:check="recovered" or
something similar.
* or an xml PI
It warns people and processors that the information may contain poor
data. It helps to design grass roots quality control mechanisms. The
information is visible *in* the document, not outside.
--
Karl Dubost
Montréal, QC, Canada
http://www.la-grange.net/karl/