RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

jon Wed, 10 Dec 2003 05:25:37 -0800

> > I've seen text/cpp and text/java, but really there are no such 
> > types. I've also 
> > seen text/x-source-code which is at least legal, if of little value to 
> > interoperability.
> > 
> > The correct MIME type for C and C++ source files is text/plain. 
> 
> This is where I disagree:


Bring forth the proofs.

a plain text file makes no difference of
> interpretation between their meta-linguistic meaning for the programming
> language that uses and need it, and the same characters used to create
> string constants or identifier names.

Yep. No distinction whatsoever.

> Unicode cannot, and must not, specify how the meta-characters used in a
> programming language must combine with other actual strings that are treated
> by the language syntax itself as _separate tokens_. This means that the
> concept of combining sequences MUST NOT be used across all language token
> boundaries. These boundaries are out of the spec of Unicode, but part of the
> spec for the language, and they must be respected at the first level even
> before trying to create other combining sequences within the _same_ token.

C and C++ both describe how a compiler deals with text it receives, beyond 
saying that the source files must be converted to the "native" character set 
they have nothing further to say about the way characters and control 
characters may or may not affect each other.

Unicode describes how characters may or may not affect each other, as well as 
specifying some encoding forms.

The interface between these scopes is problematic, but the problems aren't 
solved by your saying that if a compiler chooses to use Unicode text that it 
somehow doesn't have to play by Unicode's rules.

> So even if "text/c", "text/cpp", "text/pascal" or "text/basic" are not
> officially registered (but "text/java" and "text/javascript" are
> registered...)

You mean "So even 
if "text/c", "text/cpp", "text/pascal", "text/basic", "text/java" 
or "text/javascript" are not officially registered" surely?

 it is important to handle text sources that aren't plain
> texts as another "text/*" type, for example "text/x-other" or
> "text/x-source" or "text/x-c" or "text/x-cpp".

They'd still have to be treated as text/* types, including the fallback 
behaviour of treating them as text/plain if the charset is known and as 
application/octet-stream otherwise.

> > I'd be prepared 
> > to give good odds that that is the case with Java source files as well.
> 
> As I said "text/java" is the appropriate MIME type for Java source files...

I see no text/java. This could be my eyesight but find-and-replace can't find 
it either.

java.sun.com uses text/plain to transmit at least some java source files (I did 
a small survey, I have no intention of HEADing every one of the 4630 URIs I 
found there ending in .java).

> Just imagine what would be created with your assumption with this source:
>       const wchar_t c = L'?';
> where ? is a combining character. Using the plain/text content type for this
> C source would imply that it combines with the previous single-quote. This
> would create an opportunity for canonical composition, and thus would create
> an "equivalent" source file which would be:
>       const wchar_t c = L§';
> where this § character is a composed character. Now the source file contains
> a
> syntax error and does not compile, even though the previous source compiled
> and was giving to the c constant the value of the codepoint coding the
> ? diacritic...

Identifying a problem does not mean you automatically have found the solution, 
less still that the solution you hit upon is already prescribed by the relevant 
standards.

I don't see your example being much use as source though, human readers are 
hardly likely to find an apostrophe with a cedilla below it and a circumflex 
above it to be particularly readable code (whereas compilers would currently 
not have much difficulty since there are no decompositions beginning with 
either U+0022 or U+0027).

> Of course the programmer could avoid this nightmare by using numeric
> character
> references as in:
>       const wchar_t c = L'\U000309';

\u must be followed by four hexadecimal digits, \U by eight.

The biggest advantage of L'\u0309' over direct use of the combining character 
is you can read the thing (source is intended for human readers as well as 
compilers, the infamous and aptly, if crudely, named brainf**k is an example of 
what programming languages would look like if this were not the case).

Similarly this also enables us to explicitly state the order of combining 
diacritics, which conceivably a programmer may want to do but which neither C9 
nor simple matters of legibility enable one to do with direct use of the 
characters.

--
Jon Hanna                   | Toys and books
<http://www.hackcraft.net/> | for hospitals:
                            | <http://santa.boards.ie>

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

Reply via email to