[REPOST, LONG] XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)

Marco Cimarosti Fri, 21 Feb 2003 08:30:19 -0800

I sent this message yesterday but I didn't see it on the Unicode list.
Possibly, this was because the ZIP contained two executable programs: now I
removed them; anyway, the ZIP contains the source code.

BTW, I took the occasion to correct a few grammar errors...

_ Marco

---------------------

(Warning: I have probably succeeded in the impossible task of being more
verbose than Mr. Overington. Please start reading only if you have a few
free time... :-)

William Overington wrote:
>
> [... an interesting bibliography about XML ...]
>
> The more I read about XML the less reason there seems to be to use XML
> instead of tags!
>
> [... many interesting arguments ...]
>
> In particular, for the DVB-MHP (Digital Video Broadcasting -
> Multimedia Home Platform) there is a need to keep the programs
> as small as possible and to keep text files as small as
> possible.
>
> [... more interesting arguments ...]
>
> How would that be done using XML? Would it be done better 
> using XML than using tags? Why, or why not?
>
> [... even more interesting arguments and polite greetings ...]

I confess that I have not been patient enough to read *all* of Mr.
Overington's post. So, I apologize in advance if I have missed part or all
of William's point.

My job is to implement software based on written specifications which
represent my bosses' understanding of the requirements of our customers.
Unfortunately, the specifications that I receive are often verbose and fuzzy
like Mr. Overington's posts... :-) So I had to develop a survival strategy,
which is to quickly pass through the specification documents in search of
wording which might represent the core of what the customer actually wants.
Sometimes this works, sometimes not...

I will be pretending that William is "Overington Inc.", one of the key
customers of the company I work with, and that they are asking me to
implement a protocol to send text over the famous "Overington Multimedia
Broadcasting (OMB)", with the following requirements:

        1. The text MUST be transmitted in UTF-8 (because the CEO of
Overington Inc. thinks that UTF-8 is cute).

        2. The transmission protocol MUST implement some form of language
tagging (the details of the protocol are up to me). Particularly, the system
needs to distinguish English text from Italian text, because the two
languages will be displayed in different colors (green and red,
respectively).

        3. The OveringtonHomeBox(tm) can only accept UTF-8 plain text
interspersed with escape sequences to change color. The escape sequences
have the form "{{color=1}}", where "1" is the id of a color (blue, in this
case).

        4. The text files being transmitted MUST be darn small (bandwidth is
limited!).

        5. The processing program MUST be darn small (on-board memory is
limited!).

        6. A working prototype must be ready by tomorrow.

What I am asked to do is to define the protocol in point 2, and to implement
a software filter to produce the plain-text stream in point 3. As
development time is very narrow, I can not loose much time thinking about
it, so I have to chose one of the two solutions that are on top of my mind:

        P. Plane-14 language tags.

        X. XML.

I instinctively decide for solution P (because I assume that it would be
simpler and yield smaller files) and start defining my language tagging
protocol:

        P.1. According to the intended usage of plane-14 tags, each language
tag will be introduced by a u0E0001 (LANGUAGE TAG) and will terminate with a
u0E007F (CANCEL TAG).

        P.2. Within each begin and end tag, I will use a single tag to
identify languages, in order to save space (point 4):
         - u0E0065 (TAG LATIN SMALL LETTER E) switches to English;
         - u0E0069 (TAG LATIN SMALL LETTER I) switches to Italian;
         - u0E005E (TAG CIRCUMFLEX ACCENT) switches back to the previous
language.

Equipped with this simple protocol, I produce a sample text file: see
<wo.txt> in the attached ZIP file, containing the following text:

    "Let's learn the week days in Italian: 'Monday' is 'lunedì', 'Tuesday'
is" (...omitted...)

The English sentence is surrounded by tags u0E0001+u0E0065+u0E007F ...
u0E0001+u0E005E+u0E007F, while each embedded Italian word is surrounded by
tags u0E0001+u0E0069+u0E007F ... u0E0001+u0E005E+u0E007F.

Now I need to write a program that converts this file into a file containing
color switching commands, such as:

    "{{color=2}}Let's learn the week days in Italian: 'Monday' is
{{color=4}}'lunedÃ¬'{{color=2}}, 'Tuesday' is"...

I begin writing a few utility functions to read and write UTF-8, to write
the color escape sequences, and to handle a simple stack data structure,
needed to implement tag u0E0001+u0E005E+u0E007F. See <wo_util.c>, in the
attached ZIP file.

Then, I implement my converter as a little program that reads the incoming
language-tagged file from standard input and writes on standard output the
plain text file containing the color escape sequences. See the source code
for the program in <wo_txt.c> in the attached ZIP file.

The resulting program, <wo_txt> (not included), can be run with the
following command line:

        wo_txt < wo.txt > out.txt

As I have a little more time before tomorrow, I try and implement also the
XML solution, just for the sake of comparing it. With XML, the protocol will
be slightly different:

        X.1. I need to add a minimal syntactic paraphernalia to make the
file XML-compliant. As a minimum, I need a "<?xml...?>" declaration at the
beginning, and a root tag enclosing the whole text, which will be:
"<wo>...</wo>.

        X.2 In order to save space, I will keep the same one-letter language
ids that I used before:
            - "<e> ... </e>" will enclose English text;
            - "<i> ... </i>" will enclose Italian text;
            As the tags are already closed by "</e>" and "</i>, I don't need
an equivalent of u0E0001+u0E005E+u0E007F.

I convert the sample text file to XML (see <wo.xml> in the attached ZIP
file), and here comes the first surprise: while the Plane-14 tagged file
<wo.txt> wad 445 bytes long, the XML files is only 322 bytes long!

This seems strange, at first: because of the "/" each pair of my XML
language tags is one character longer than the corresponding pair of
Plane-14 tags. Moreover, the syntactical overhead in X.1 above cannot be
less than 30 characters. Of course, the reason for the 123-byte spare is
that, in UTF-8, the characters composing XML tags only take one byte each,
while Plane-14 tag character take four bytes each.

This little gain on point 4 of requirements prompts me to continue with the
XML experiment. Of course, I guess implementing the converter program must
be much more complicated for XML that it has been with plain text...

Surprisingly, this is not: the utility functions that in <wo_util.c> are
still useful, and only a handful of modification to <wo_txt.c> are necessary
to implement <wo_xml.c>.

The code that I wrote to interpret a sequence like u0E0001+u0E0065+u0E007F
... u0E0001+u0E005E+u0E007F works equally well for a sequence like "<e> ...
</e>". Moreover, the same code that I wrote to ignore an unknown Plane-14
tags (e.g., u0E0001+u0E0067+u0E007F) works equally well with unknown XML
tags (e.g. "<wo>" or "<?xml version="1.0"?>").

The only complication that I had to introduce in <wo_xml.c> is a function to
decode character entities such as "&gt;" or "&#x4e00;". But, after all,
that's just a few lines of codes. Not implementing this, would have meant
that characters "<", and "&" could not be transmitted.

The resulting progra, <wo_xml> (not included) works in the same fashion as
the other program, and gives the same output:

        wo_xml < wo.xml > out.txt

Now I try and use those 123 spare bytes for adding some more flesh. The
attached ZIP files contains a second version of the sample XML text:
<wo_cute.xml>.

You may have noticed that, although <wo_cute.xml> is much bigger than
<wo.xml> (but still slightly smaller than <wo.txt>!), passing it through
<wo_xml.exe> results in exactly the same output file.

This is because the extra declarations that I added ("encoding=...",
"<?xml-stylesheet...?>", "<!DOCTYPE...>"), are simply skipped by the
bare-bone XML processor embedded in our Overington Multimedia Broadcasting
system.

But this information, if present, allows you to publish the *same* material
on both your proprietary system and on other media, such as the Web.

If you put <wo_cute.xml>, <wo.css> and <wo.dtd> in the same directory, and
open <wo_cute.xml> with a decently recent browser, you will see that the
English text will automagically be displayed in green and Italian text in
red, exactly as they are supposed to appear on the Overingtonian system.

But size and Web-compatibility are not the only advantages of the XML
solution:

        a. An XML file is human readable and may be edited with any text
editor; although the Plain-14 file claims to be "plain text", each language
tag character appears as a three black boxes in any UTF-8 editor (and as a
random twelve "accented" characters in a non-UTF-8 editor).

        b. There are plenty out-of-the-box utilities that everybody can use
to edit, view and validate XML files. Many of these utilities are free of
price, or come bundled with operating systems.

        c. Every bookshop round the world sells books about XML, and in
every town in the world you can easily hire XML programmers. So you don't
need a big effort to train your content engineers: just hire "XML people"
and give them your DTD file...

        d. XML is built on top of Unicode, but *not* bound to it. Imagine
that the CEO of Overington Inc. comes saying that he also wants support for
ISO 8858-1, JIS X 0208 and a third encoding of my choice... With the XML
solution, we just ask them to change the "encoding" declaration in the file,
and to allow us a couple of days to change the software. If we used
*Unicode* Plane-14 tags, what would we be going to tell them?

        e. XML is extensible. Once we have fixed their silly requirement for
green English and red Italian, they will perhaps ask for italic, bold,
pictures, sounds, etc. etc. With XML, defining the protocol for such things
is straightforward, so we only have to enhance our <wo_xml.exe> program (and
our <wo.css> for the Web edition). With Plane-14, what are you going to do?
Do you hope that the Unicode Consortium will accept adding u0E0002, u0E0003,
u0E0004, and so on, just to match your needs? Or do you want to go on
playing with your PUA experiments? And do you believe that all producers of
browser will promptly follow you?

Ciao.

_ Marco

wo.zip
Description: Binary data

[REPOST, LONG] XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)

Reply via email to