Re: Unicode versus XML language tagging

PILCH Hartmut Tue, 15 Jan 2002 06:28:28 -0800

> > 5 years ago. Plaintext remains a far more powerful concept and XML is
> > mostly a markup mechanism designed to overcome deficiencies in ASCII
> > that appears rather clumsy in a pure Unicode plaintext world.
>
> This characterization of XML is about as silly as it is possible to get.
> You might just as well claim that XML is designed to overcome world
> hunger. The one place where XML tries to overcome a specifically-ASCII
> limitation is where it *mandates Unicode support*.


I think you are misinterpreting the quotation.  What is meant (judging
from parts not quoted here) is:  XML character entities, language
tags and the like (as opposed to document type specific elements such as
"field 47 of a patent application") are ... rather clumsy ...

Indeed there are many levels between character coding and document
structures and imho even the way SGML/XML handle document structures are
debatable.  It seems thus a good idea to standardise some of the
in-between things independently of SGML.  STX is one approach.  Another
would be to have a generic way of assigning characteristics to a certain
chunk of text, quite independent of whether these are language tags or
something else, making the semantics user-defined.  The basic form of this
imho universal construction is


   esc begin sep arg0 sep arg1 [ sep arg2 [ sep ... ] ] end

where only 'esc' needs to be a unicode-defined character.
'begin' and 'end' could be any user-chosen bracket pair and
'sep' could be any character that is by its position defined to
be the argument separator character.  arg0 would usually refer
to something defined in the user's hypertext system, e.g.
a language tag or an emphasis tag.  There could be any number
of arguments, from zero to infinite, but in most cases one
would have one argument: a piece of normal unicode text (which
may not contain the 'sep' character).

Thus, assuming I define

        esc: %
        begin, end: ( )
        sep: |

I could have the following expansions

        this is %(bold|not) true.

                ==> this is <bold>not</bold> true.

        SGML stands for %(SGML)

                ==> SGML stants for Standard Hypertext Markup Language

        My %(ref|mlht|system) system also uses this syntax

           ==> My <a href="http://mlht.ffii.org";>system</a> also ...

or, mixed language text for perfect multilingual typesetting in the
following manner:

        %(lang|ja|Nihonjin no tame no %(lang|zh|Zhongwen) kyoukasyo)

Imho the content side of 'lang' 'ja' etc does not need to be in Unicode.
It is enough to have only the 'esc' symbol, i.e. the 'universal functional
expression prefix', as a Unicode character and leave the rest to users to
fill with life, perhaps proposing a few handy conventions such as the one
above.  Mapping such conventions to XML is of course easy, and I would
also propose directly integrating them into STX (structured text).  Indeed
some books have been typeset in STX (e.g. the new Zope Book), which is
very close to plain text.  Extending STX with the above-proposed
'universal function expression' syntax would make it so powerful that
any SGML/XML-based markup would be very rarely needed.

-- 
Hartmut Pilch                                          http://phm.ffii.org/
Protecting Innovation against Patent Inflation       http://swpat.ffii.org/
100,000 signatures against software patents      http://www.noepatents.org/




--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode versus XML language tagging

Reply via email to