Michael Everson <everson at evertype dot com> wrote:

>> 3.  Is there any method of tagging, anywhere, that is lighter-weight
>> than Plane 14?  (Corollary: Is "lightweight" important?)
>
> HTML and XML markup?

and <Peter_Constable at sil dot org> replied:

> Doug was already comparing the plane 14 characters to HTML and XML,
> and clearly considers the latter to be relatively heavy -- and
> certainly they are heavier.

Certainly I don't want to claim, as some have, that HTML and XML and
SGML are *very* heavy.  But there is definitely a difference.

HTML language tags (used here to include the slightly more complex XML
syntax as well) are of the form <lang="xx">, whereas Plane 14 tags are
of the form ?xx where ? represents U+E0001 and xx, the language
identifier, is translated to Plane 14.  (HTML allows the alternative
form <lang=xx> without quotation marks, but XML does not.)  In either
case, there is clearly more parsing to be done in the case of HTML:

* the spelling of the tag "lang" must be checked;
* alternatively, it might be another type of tag altogether (not a
language tag);
* the equal sign = must be checked;
* there must be exactly 0 (HTML optional) or 2 quotation marks
surrounding the identifier;
* the greater-than sign > must be checked.

Plane 14 tags begin with a single, dedicated code point that means
"language tag," so no syntax checking is needed at that point.  The
language identifier itself is encoded by dedicated code points, so
checking for "the end of the tag" is simpler (last character in the tag
range, or end of stream).

Parsing the cancel tag is likewise simpler:  </lang> vs. U+E0001
U+E007F.  For that matter, a Plane 14 cancel tag is not always
necessary, which is not true in HTML.

Any syntax checking of the identifier itself (e.g. "en" is valid but
"em" is not) must be performed regardless of the mechanism, so neither
approach holds an advantage there.

Peter continued:

>> 2.  What extra processing is necessary to ignore Plane 14 tags that
>> wouldn't be necessary to ignore any other Unicode character(s)?
>
> None. And if some form of light-weight markup were used, then there
> would inevitably be a need for some kind of character escape
mechanism,
> so ignoring language tagging would still entail interpreting of the
> escapes. E.g.
>
> #LT=en#This is English text, #LT=fr# et ce texte ci est en français.
> #LT=en#To use the pound character in text, as in "He's in room ##4,"
> you have to encode it twice.

Exactly.  With the dedicated code points in Plane 14, you don't need
either the closing tag or the double-# escaping scheme.

I am not arguing that it takes Herculean effort to program a parser for
ASCII-based language tags, only that Plane 14 tags are even simpler, and
that some text applications call for the mechanism of greater
simplicity.

-Doug Ewell
 Fullerton, California


Reply via email to