RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

Philippe Verdy Tue, 09 Dec 2003 13:26:41 -0800

[EMAIL PROTECTED] writes:
> > > You might as well say that C code is not plain text because it too is
> > > subject to special canons of interpretation.
> > 
> > C, C++ and Java source files are not plain text as well (they 
> > have their own
> 
> C, C++ and Java source files are plain text.
> 
> > "text/*" MIME type, which is NOT "text/plain" notably because 
> > of the rules
> 
> I've seen text/cpp and text/java, but really there are no such 
> types. I've also 
> seen text/x-source-code which is at least legal, if of little value to 
> interoperability.
> 
> The correct MIME type for C and C++ source files is text/plain.


This is where I disagree: a plain text file makes no difference of
interpretation between their meta-linguistic meaning for the programming
language that uses and need it, and the same characters used to create
string constants or identifier names.

Unicode cannot, and must not, specify how the meta-characters used in a
programming language must combine with other actual strings that are treated
by the language syntax itself as _separate tokens_. This means that the
concept of combining sequences MUST NOT be used across all language token
boundaries. These boundaries are out of the spec of Unicode, but part of the
spec for the language, and they must be respected at the first level even
before trying to create other combining sequences within the _same_ token.

So even if "text/c", "text/cpp", "text/pascal" or "text/basic" are not
officially registered (but "text/java" and "text/javascript" are
registered...) it is important to handle text sources that aren't plain
texts as another "text/*" type, for example "text/x-other" or
"text/x-source" or "text/x-c" or "text/x-cpp".

> I'd be prepared 
> to give good odds that that is the case with Java source files as well.

As I said "text/java" is the appropriate MIME type for Java source files...

> > associated with end-of-lines, notably in presence of comments).
> 
> As source files (that is, at the stage in processing at which a 
> human user can see the source and edit it) the only handling required
> for end-of-lines is converstion of new line function characters, the same
> as for any other use of plain text.
> 
> The treatment of end-of-lines as significant when processed (for example 
> following one-line // comments) is a matter of what an 
> application chooses to do with a particular character. This is no
> different than an indexer deciding that a plain text file contains a
> particular word, or for that matter in my putting coffee filters into my
> basket if I see "coffee filters" written on my shopping list.

Just imagine what would be created with your assumption with this source:
        const wchar_t c = L'?';
where ? is a combining character. Using the plain/text content type for this
C source would imply that it combines with the previous single-quote. This
would create an opportunity for canonical composition, and thus would create
an "equivalent" source file which would be:
        const wchar_t c = L§';
where this § character is a composed character. Now the source file contains
a
syntax error and does not compile, even though the previous source compiled
and was giving to the c constant the value of the codepoint coding the
? diacritic...

Of course the programmer could avoid this nightmare by using numeric
character
references as in:
        const wchar_t c = L'\U000309';
or may be (but less portable, as it assumes the runtime encoding form used
by
wchar_t as being UCS4 or UTF-16 or UTF2, when the source file may be coded
in a non-Unicode charset):
        const wchar_t c = (wchar_t)0x000309ul;

> > > But both XML/HTML/SGML and the various programming languages are plain
> > text.
> > 
> > See "text/xml", "text/html" and "text/sgml" MIME types. They also aren't
> > "text/plain" so they have their own interpretation of Unicode characters
> > which is not the one found in the Unicode standard.
> 
> They have their own interpretation of tne Unicode characters which is *in 
> addition to*

This is not *in addition* but *instead of* and thus this breaks the rule
of Unicode conformance at that level, as the code point does not match the
meaning REQUIRED by conforming applications as being a code point, coding
an abstract character with a well-defined representative glyph and
REQUIRED composability with surrounding characters.

> the one found in the Unicode standard. As to all but the simplest
> applications that use Unicode (as interesting as many of them are,
> characters are of little use on their own).

Note that a simple text editor such as NotePad can safely be used to edit
source files, simply because it does not attempt to perform any
normalization
of the loaded or saved files, even when editing it (there's not even a edit
menu option to normalise any area of the text in the edit buffer).

Most editors for programming languages treat individual characters as really
individual and completely unrelated to each other. This means that they
won't
attempt any normalization, so characters will not be reordered, or
recomposed.
This is an important and needed requirement for programming source files,
but
it is not required for plain text files.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

Reply via email to