date:20031209

RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))

2003-12-09 Thread Peter Constable

From: [EMAIL PROTECTED] on behalf of Kenneth Whistler

Athabascan languages in Canada are also written with
practical orthographies such as these
 
At least two of which (Dogrib and one or both varieties of Slavey) use a cased glottal 
stop, not U+0027.
 
 
Nobody is agitating for an uppercase
apostrophe.

Not in Canada, that I know of. (I've seen indication of languages in Russia that have 
a case distinction for ' and possible also .)

 
For these, and thousands of other documents published on
Athabascan languages over the last century, there was just
a glottal stop -- not an uppercase and a lowercase glottal
stop.
 
That's true of phonetic transcriptions. But for orthographies, there are some that 
have case.
 
 
 
It is because that is
what the IPA settled on for their prescriptive preference
for the shape of a glottal stop. (Note: for a *glottal stop*,
not for a *capital glottal stop*. The IPA does not have
casing distinctions.)
 
Which only tells us that there should be no predisposition to consider the glottal 
stop upper rather than lower case, or vice versa. It does not tell us that that 
character cannot be involved in a case-pair relationship.
 
 
 
The prestige of the IPA specification
is such that many fonts have used that form as well. And,
indeed, it influenced the choice for the Unicode
representative glyph, which in turn has influenced what
OS vendors have put in their fonts. So, while there
are multiple different glyphs in print for a glottal stop
(see Pullum  Ladusaw for different examples), most of
which don't *look like* capital letters, the IPA glyph
has become the preferred one, simply because IPA prefers
it. 
 
All of which is very germane to my argument.
 
 
And that is unfortunate, because that one glyph is
the one that people think *looks like* a capital letter,
and which thus causes the confusion when an orthographic
innovation decides it needs to introduce casing for it.
 
Not only think looks like, but behave as though it is.


Now I presume from Michael's assertion that there is
some Athabascan community *somewhere* that has started
to make an initial case distinction for glottal stop,
 
This thread began when I provided a scanned image.
 

and that in the fonts they use, their uppercase glottal
stop *looks like* the IPA glottal stop, and that for
the body text they innovated a miniature of same. Hence
the conclusion that we must treat the existing form
as the *capital* and need to encode a new lowercase
form.
 
That alone is not the basis of the argument. You have provided the basis for 
additional, strong argumentation yourself: 0294 cannot be displayed using the 
lowercase glyph as it's design as a cap-height letter is well established in many 
fonts. If a new upper-case glottal character is created, a distinct lowercase glottal 
would be needed, but then there would be two characters (0294 and the new UC glottal) 
that have exactly the same appearance and would be getting confused with mixed up and 
inconsistent data and processes for years to come.


That, however, is utterly backward. It is clear that in
these cases, following 100 years of monocase usage of
glottal stop, that the innovation (as in many adaptations
of IPA) is to create an uppercase letter to go with the
lowercase one. 
 
This argument is completely empty, as it depends on the premise that the existing 
character can be considered the lowercase one. You have asserted it to be so, but 
you have not given reasoning why it must be considered so. That seems to be especially 
required given that you observe in the same breath that during the 100 years of its 
usage this character has been monocase.
 
Let's roll back the discussion for a moment. Suppose before this thread had started up 
someone had come along and said, 0294 is obviously caseless and has always been so; 
it should have a general category of Lo rather than Ll, just like the dental click 
(01C0) and other caseless phonetic symbols. Would you be able to make an compelling 
argument that it must be Ll and not Lo? I don't see how anyone possibly could.
 
But to maintain the premise that this only-ever-monocased character is the lowercase 
one, you've got to have solid reasons to say it could not be Lo and must be Ll.
 
IMO, it takes an emporer willing to wear clothes spun with thread that only the 
wisest could see to say that, though the cap-height character the Dogrib and Slavey 
are using as a capital has *exactly* the same appearance and metrics as 0294, it is 
actually the thing that is half the height and has a different shape that's the same 
as 0294, and that this exact replica is really a new innovation.
 
Ken, you have not given reasons why 0294 cannot be considered uppercase -- no evidence 
that it has in the past been used as lowercase in a case pair, or that usage as an 
uppercase in a case pair would result in problems in implementation, usability, or 
management of data. You have merely asserted that the original character was a

RE: New symbols (was Qumran Greek)

2003-12-09 Thread Peter Constable

 and why
 aren't they linked together for us fringies? 

They are...
 
For some reason, my first thought was of Ford Prefect asking the fellow regarding the 
not-well-publicized plans to build a by-pass, Have you ever thought of going into 
advertizing? -)
 
 
 
No, it is made from the River Liffey
 
When I was there in 1979, the river was introduced to me as the whiffy Liffey, and I 
was told that the ships in the river scooped up the water and took it up stream where 
they just put it into the bottles.
 
 
 
Peter

RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))

2003-12-09 Thread Peter Constable

From: [EMAIL PROTECTED] on behalf of Michael Everson

to use the kinds of uppercase
glyph models used in similar instances of after-the-fact
uppercase inventions based on IPA or other phonetic
alphabets and usages.

A modified capital P would probably do.

[??!!]
 
Michael, you've seen what they are using. How will the community be served when type 
designers start creating fonts that have a cap-height glyph for 0294 supplemented by a 
modified capital P? 
 
If a band of Rumple-stiltskin Latins from Caesar's administration suddenly awoke from 
their 2000 year slumber, reviewed the situation and then pronounced, This 'w' is not 
acceptable to us; you shall be permitted to inscribe an additional sound from your 
barbaric northern tongue using an O split in two parts, and one size is adequate, how 
excited with their decision do you think we'd be?
 
 
 
Peter Constable

RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Peter Constable

From: [EMAIL PROTECTED] on behalf of Kenneth Whistler

 Unicode doesn't prevent styling, of course. But having 'logical' order
 instead of 'visual' makes it a hard task for the application and the
 renderer.
 This is witnessed by the thin-spread support for this.

Yes...
 
Ken conceded the claim too readily. Glyph re-ordering due to a logical encoding order 
that is different from visual order may mean that certain types of styling (of the 
re-ordered character) may not be supported in some implementations, but it does *not* 
mean that this is, in general, a hard task. Style information is applied to 
characters, and as long as there is a 1:m association between characters and glyphs 
and there is a path to transform the styling information to match the character/glyph 
transformations, styling is in principle possible. (There's a constraint that styling 
might not be possible if the styling differences require different fonts but the glyph 
transformations that occur require rule contexts to span such a style boundary.)
 
(Expecting one component from a precomposed character to be styled differently from 
the rest, however, would be somewhat hard.)
 
In particular, for reordering this is easy to demonstrate by considering a 
hypothetical complex-script rendering implementation in which processing is divided 
into two stages: character re-ordering, and glyph transformation. In the first stage, 
all that happens is that a string is mapped to a temporary string used internally 
only, in which characters are reordered into visual order. (Surroundrant characters 
with no decomposition would be mapped into multiple internal-use-only virtual 
characters.) Thus, a styled string such as stringkspan 
color=rede/span/string would transform in the first stage to stringspan 
color=rede/spank/string. There is nothing hard in such processing.
 
 
(Of course, whether it is harder to get people to implement support for one thing 
rather than another is an entirely different question.)
 
 
 
 
Peter Constable

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Kirk

On 08/12/2003 15:51, Philippe Verdy wrote:

...
Peter Kirk writes:
 

Agreed. But now we are told that the latter is illegal XML because a 
combining mark is not permitted (by XML, not by Unicode) after span.
   

It is not forbidden by XML. It's just that handling a XML file (which is not
plain-text) as if it was a Unicode plain-text when performing normalization
of the file may produce unexpected composition of characters which are part
of the XML syntax.
...
 

Philippe, you have now stated this (several times). But just a day 
earlier you yourself stated that the rule forbidding combining marks at 
the start of a string would never be relaxed because it is fundamental 
to the XML containment model. You don't usually contradict yourself 
quite so obviously.

Anyone, please, is it or is it not true that XML forbids, or will forbid 
in future versions, combining characters immediately after markup?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Peter Kirk

On 08/12/2003 16:17, Kenneth Whistler wrote:

...

Having an 'invisible consonant' to call for rendering of the vowel sign
in isolation (and without the dotted circle), would also help the limited
number of cases where the styled single character is needed - but in
a rather hackish way.
   

That is what the SPACE as base character is for. If some renderers
insist on rendering such combinations with a dotted circle glyph,
that is an issue in the renderer -- it is not a defect in the
encoding standard for not having a way to represent the vowel
sign in isolation.
 

SPACE is unsuitable for this function for at least two good reasons:

1) because of its word and line breaking characteristics;

2) because in a case like this no extra spacing is required. The vowel 
sign is a spacing character in itself, although a combining mark. SPACE 
is expected to add its own spacing. In the absence of clearly defined 
rules to the contrary, renderers will render this combination of SPACE 
with a Tamil vowel with an extra space which is not wanted. (As for 
which side of the vowel the space will appear, that is anyone's guess!)

This is yet another example to add to a number that I have identified 
showing that the reuse of SPACE and NBSP as carriers for diacritics is 
an undesirable overloading of character semantics. I propose again a new 
base character for carrying combining marks, with no glyph and a width 
just as wide as that required to display the combining marks. The 
mechanism already defined for using SPACE and NBSP for this should be 
deprecated, although not abolished.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Philippe Verdy

Peter Kirk writes:
 Philippe, you have now stated this (several times). But just a day 
 earlier you yourself stated that the rule forbidding combining marks at 
 the start of a string would never be relaxed because it is fundamental 
 to the XML containment model. You don't usually contradict yourself 
 quite so obviously.

I don't know how you interpreted what I may have said a few days before.
I have certainly not said that XML forbids combining marks at the start
of XML, just that W3C does not _recommand_ it as well as any other
defective combining sequences, as they are known to cause problems
(for example when it's difficult to track the effective text file type)

That's the same for NFC: it's just a recommandation, not a requirement,
and for XML there are no such canonical equivalents, just distinct
strings. It's up to the application using the _parsed_ XML document
tree to do, if needed, the normalization steps. But this should occur
only _after_ the document has been parsed and possibly validated
according to its schema.

Generally, noramlization of strings will only occur in the very last
step just before output the result (for example for font rendering),
but even at this step, the font may provide information which may
require glyph processing or character substitutions that is not well
performed with just a normalized NFC form. So in fact, the XML
application can/should perform its own necessary normalizations only
at steps where it has a benefit, but not at the file stream level as
the XML stream itself is not plain text.

__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: [OT]

2003-12-09 Thread Peter Kirk

On 08/12/2003 17:29, Philippe Verdy wrote:

...

Nota: when speaking about alcohol in public areas, we have to
add here in France a mandatory legal notice:
L'abus d'alcool est dangereux pour la sante, appreciez et
consommez le avec moderation.
 

...
Despite your French notice about danger to the health (not to the 
sanity, though that might be true, too), Guinness was actually 
introduced as a health drink. I think the problem was that too many 
Irish people were spending their money on whiskey and not eating well, 
so Arthur Guinness introduced a drink that was so full of nutrients that 
you could live on it, and so heavy that you can't drink enough of it to 
get drunk!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Kirk

On 09/12/2003 03:41, Philippe Verdy wrote:

Peter Kirk writes:
 

Philippe, you have now stated this (several times). But just a day 
earlier you yourself stated that the rule forbidding combining marks at 
the start of a string would never be relaxed because it is fundamental 
to the XML containment model. You don't usually contradict yourself 
quite so obviously.
   

I don't know how you interpreted what I may have said a few days before.
I have certainly not said that XML forbids combining marks at the start
of XML, just that W3C does not _recommand_ it as well as any other
defective combining sequences, as they are known to cause problems
(for example when it's difficult to track the effective text file type)
 

So, let's get this clear. Within an XML or HTML document, if I want an e 
with a red acute accent on it, it is quite permissible to write:

espan class=red-text{U+0301}/span

where {U+0301} is replaced by the actual Unicode character, and 
red-text is defined in the stylesheet. So it is not a problem that 
there is a defective combining sequence, nor that the accent is not 
combined with the e as it would be in NFC. Is that correct?

If this is correct, then the Tamil problem which Peter J is concerned 
about has gone away completely, or at least it is reduced to a tricky 
rendering issue.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jcowan

Peter Kirk scripsit:

 Anyone, please, is it or is it not true that XML forbids, or will forbid 
 in future versions, combining characters immediately after markup?

XML 1.0 is silent on the subject.

The W3C Character Model (which is not official yet) says that
content developers SHOULD avoid composing characters at the beginning
of constructs that may be significant, such as at the beginning of an
entity that will be included, immediately after a construct that causes
inclusion or immediately after markup.

XML 1.1 (which is not official yet either) references the Character
Model and states which constructs are significant.

The technical meaning of SHOULD is defined by RFC 2119:  This word
[...] means that there may exist valid reasons in particular circumstances
to ignore a particular item, but the full implications must be understood
and carefully weighed before choosing a different course.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
It's the old, old story.  Droid meets droid.  Droid becomes chameleon. 
Droid loses chameleon, chameleon becomes blob, droid gets blob back
again.  It's a classic tale.  --Kryten, _Red Dwarf_

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jon

 Anyone, please, is it or is it not true that XML forbids, or will forbid 
 in future versions, combining characters immediately after markup?

XML does not forbid it, it does recommend you avoid it.

Charmod defines include-normalization and full-normalization which go 
beyond Unicode normalisation in guaranteeing that normalisation will not be 
altered through various concatenations and inclusions that may occur in the 
processing of XML data. These do forbid it, though I don't think Charmod 
insists on their being used.

The specification of an application of XML could cite Charmod and insist on 
include- or full-normalisation. In some cases this would have no real effect 
(in some data-orientated rather than document-orientated uses of XML), in 
others it would be a restriction on what could be done in the application.

Not forbidding it problems, the most spectacular being the possibility of 
COMBINING LONG SOLIDUS OVERLAY causing a well-formed XML document to have a 
canonically equivalent (in both the Unicode and XML concepts of c14n, since the 
latter makes use of NFC) document that was not well-formed XML.

Colouring of diacritics can be performed through other means. 
http://www.w3.org/TR/charmod/benoit.svg is an SVG example. This seems a 
superior method for at least some of the use-cases cited anyway (I've missed 
some of this thread though).

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie/

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jcowan

Philippe Verdy scripsit:

 When in doubt, don't perform any normalization of XML _files_ as they are
 NOT plain text: you need a XML parser to do it safely only in relevant
 sections of this file. All you could do safely is to possibly reencode XML
 files (for example from UTF-8 to UTF-16 encoding schemes).

This is wildly overstated.  XML files most certainly are plain text,
though they may be interpreted as fancy text in contexts that understand
XML.  With the insignificant exception of a markup  immediately
followed by a U+0338 character, it is entirely safe to normalize XML
files according to any normalization.  (It is true that NK* normalization
forms may lose information, but XML document authors are discouraged
from using compatibility decomposables in any case.)

What is not allowed, and this makes XML technically non-conformant to the
Unicode Standard, is to make arbitrary and unsystematic replacements of
one canonically equivalent form with another.  For example, if an element
name is h)Bétérogénéité (a favorite word of mine), decomposing the
start-tag while leaving the end-tag composed would make the document no
longer well-formed XML.  In my opinion, this is a corner case that may
be safely ignored.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
'Tis the Linux rebellion / Let coders take their place,
The Linux-nationale / Shall Microsoft outpace,
We can write better programs / Our CPUs won't stall,
So raise the penguin banner of / The Linux-nationale.

Re: [OT]

2003-12-09 Thread jon

 Despite your French notice about danger to the health (not to the 
 sanity, though that might be true, too), Guinness was actually 
 introduced as a health drink. I think the problem was that too many 
 Irish people were spending their money on whiskey and not eating well, 
 so Arthur Guinness introduced a drink that was so full of nutrients that 
 you could live on it, and so heavy that you can't drink enough of it to 
 get drunk!

Alas, you can easily drink enough of it to get drunk, even if you don't like 
being drunk, since its delicious taste will lead you to exceed your limit.

Stout was indeed given as a health drink in small doses in certain cases, it's 
one of the few foods that are a good source of both iron and calcium. However 
the only doctor I've heard of recommending it in recent years was a bone 
specialist who was trained in China; he professed a belief that Guinness was 
why the Irish had thicker bones than the Chinese in his experience. There are 
considerably more doctors who would say that if you were going to drink a beer 
it should be stout, without going so far as to actually recommend it in and of 
itself.

A pint of plain's your only man.

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie/

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jon

 So, let's get this clear. Within an XML or HTML document, if I want an e 
 with a red acute accent on it, it is quite permissible to write:
 
 espan class=red-text{U+0301}/span
 
 where {U+0301} is replaced by the actual Unicode character, and 
 red-text is defined in the stylesheet. So it is not a problem that 
 there is a defective combining sequence, nor that the accent is not 
 combined with the e as it would be in NFC. Is that correct?

You can, whether you should is another thing, and whether it would render 
correctly yet another.

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie/

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Philippe Verdy

[EMAIL PROTECTED] writes:
 What is not allowed, and this makes XML technically non-conformant to the
 Unicode Standard

Where did you see that XML files need to be conformant to the Unicode
standard?

XML files are definitely NOT plain text (if this was the case, then it would
be forbidden to interpret  as a special markup character instead of the
standard Unicode base character with its associated glyph)...

_Only_ fragments of XML files are plain-text and are fully conforming to the
Unicode standard.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Philippe Verdy



 -Message d'origine-
 De : Peter Kirk [mailto:[EMAIL PROTECTED]
 Envoye : mardi 9 decembre 2003 13:17
 A : [EMAIL PROTECTED]
 Cc : [EMAIL PROTECTED]
 Objet : Re: Coloured diacritics (Was: Transcoding Tamil in the presence
 of markup)
 
 
 On 09/12/2003 03:41, Philippe Verdy wrote:
 
 Peter Kirk writes:
   
 
 Philippe, you have now stated this (several times). But just a day 
 earlier you yourself stated that the rule forbidding combining marks at 
 the start of a string would never be relaxed because it is fundamental 
 to the XML containment model. You don't usually contradict yourself 
 quite so obviously.
 
 
 
 I don't know how you interpreted what I may have said a few days before.
 I have certainly not said that XML forbids combining marks at the start
 of XML, just that W3C does not _recommand_ it as well as any other
 defective combining sequences, as they are known to cause problems
 (for example when it's difficult to track the effective text file type)
   
 
 So, let's get this clear. Within an XML or HTML document, if I want an e 
 with a red acute accent on it, it is quite permissible to write:
 
 espan class=red-text{U+0301}/span
 
 where {U+0301} is replaced by the actual Unicode character, and 
 red-text is defined in the stylesheet. So it is not a problem that 
 there is a defective combining sequence, nor that the accent is not 
 combined with the e as it would be in NFC. Is that correct?

That's right: the text element within span just contains the string with
the isolated diacritic, it is already in NFC form despite it is defective.
And it must not be parsed by creating a combining sequence that includes
the  terminating the span tag (interpretation of combining sequences
is only valid within plain-text, and thus excludes syntactic characters
used in XML.

Note that this is not specific to XML. Any text/* format that is not
plain text (notably programming source files, shell scripts, HTML files,
stylesheets, and JavaScript files) should be handled this way, where
the syntax of the language governs the rules for parsing it, before
even trying to use Unicode definitions on parsed tokens used in that
programming language.

So normalization should never be performed on whole files that are not
explicitly of file type text/plain (either with an explicit meta-data
such as MIME headers during transmissions, or locally with OS-specific
conventions on file extension such as .txt)

When in doubt, for example in CVS repositories or in diff/merge tools,
normalization must not be performed, and the current encoding form of
text files must be preserved, each time that tools does not implement
an accurate parser for the syntaxic and lexical rules of the effective
file type or language, which may or may not accept defective combining
sequences as valid plain-text strings (this includes identifiers,
however Unicode recommands a list of characters that can be used to
start an identifier, and this list excludes all non-starter combining
characters.)


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jcowan

Philippe Verdy scripsit:

 XML files are definitely NOT plain text (if this was the case, then it would
 be forbidden to interpret  as a special markup character instead of the
 standard Unicode base character with its associated glyph)...

You might as well say that C code is not plain text because it too is
subject to special canons of interpretation.  But both XML/HTML/SGML
and the various programming languages are plain text: they are written
with plain-text editors, manipulated with plain-text tools, and can be
rendered with plain-text renderers.  The fact that other things can be
done with them is neither here nor there.

-- 
In my last lifetime,   John Cowan
I believed in reincarnation;http://www.ccil.org/~cowan
in this lifetime,   [EMAIL PROTECTED]
I don't.  --Thiagi http://www.reutershealth.com

Re: New symbols (was Qumran Greek)

2003-12-09 Thread D. Starner

  http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2676.pdf is
  a complete listing of new symbols to go into Unicode
 
 Thanks!--how many Web sites do you all have?  

http://www.evertype.com/formal.html is a good link to what
Michael Everson is doing. http://www.dkuug.dk/jtc1/sc2/wg2/docs/
is an index page with all the proposals and documents, if 
you want to keep up the new proposals.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Jacobi

Hi Peter, All,

Peter Kirk [EMAIL PROTECTED] wrote:
 [...]
 [About espan class=red-text#x0301;/span being correct HTML}
 [...]
 If this is correct, then the Tamil problem which Peter J is concerned 
 about has gone away completely, or at least it is reduced to a tricky 
 rendering issue.

Jungshik and Martin already voted, that 
  span style='color:#00f'#x0BB2;/span#x0BC6;
is perfectly valid HTML, and I assume the same holds for
  #x0BB2;span style='color:#00f'#x0BC6;/span

But, seeing real-life user agents mishandle this, and being
confronted with task of writing a converter from legacy
Tamil encodings (in visual order), there is some temptation
to markup this as:
  {INV}#x0BC6;span style='color:#00f'#x0BB2;/span
or respectively 
  span style='color:#00f'{INV}#x0BC6;/span#x0BB2;

With {INV} being the hypothetical, not-spacing-adding, invisible
consonant.

But 
a) {INV} doesn't exist (so far)
and
b) The user agents I tested render {SPACE}#x0BC6;
with the misguided dotted circle.

So, I can easily withstand this temptation (for now).

Regards,
Peter Jacobi


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

RE: [OT]

2003-12-09 Thread Philippe Verdy

[EMAIL PROTECTED] wrote:
 Stout was indeed given as a health drink in small doses in 
 certain cases, it's one of the few foods that are a good 
 source of both iron and calcium. However the only doctor 
 I've heard of recommending it in recent years was a bone 
 specialist who was trained in China; he professed a belief 
 that Guinness was why the Irish had thicker bones than the 
 Chinese in his experience. There are considerably more 
 doctors who would say that if you were going to drink a beer 
 it should be stout, without going so far as to actually 
 recommend it in and of itself.

You'll find also interesting studies about why French people 
experiment low levels of heart/vessels diseases, despite they 
eat above-average quantities of fat and sugars. One reason is 
that they drink wine. A very moderate absorbtion of alcohol is 
beneficial to the health, but this is true ONLY if you compare 
populations drinking NO alcohol and those that just drink a 
little.

This is easy to see when comparing children that drink no 
alcohol and experiment more heart/vessel diseases than those 
that drink a few milliliters of alcohol each day (a small 
and beneficial absorbtion of alcohol is possible just by eating 
fruits or drinking it in a medical form as food supplements).

A very small daily absorbtion of alcohol helps the body to 
fluidify the blood and clean up its vessels from excesses of 
fat. You don't need a lot (a full bottle of beer is not needed, 
but you cannot keep beer drinkable for a long time once it has 
been open).

That's why beer is not a recommanded form of absorbtion of 
alcohol, as it requires you to exceed the sufficient level to 
get its benefit. On the opposite, you can open a 75cl bottle 
of wine at 6° (which contains 45ml of pure alcohol) and drink 
it on three days to have 15ml of pure alcohol each day (a 
reasonnable and beneficial level for adults of average weight
of about 80kg).

For children, due to their reduced weight, and thus reduced 
volume of blood, this quantity must be reduced accordingly, 
and this is possible by using medical forms of alcohol, that 
you will find in pharmacies in products that also contain 
essential oils, vitamins, and mineral supplements.

You should know that even babies are sometimes given tiny 
quantities of alcohol within curing medications to help 
them recover from infectious diseases: the needed quantity
is less than 5 milliliters.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Philippe Verdy

[EMAIL PROTECTED] writes:
 Philippe Verdy scripsit:
  XML files are definitely NOT plain text (if this was the case, 
  then it would be forbidden to interpret  as a special markup 
  character instead of the standard Unicode base character with 
  its associated glyph)...
 
 You might as well say that C code is not plain text because it too is
 subject to special canons of interpretation.  But both XML/HTML/SGML
 and the various programming languages are plain text: they are written
 with plain-text editors, manipulated with plain-text tools, and can be
 rendered with plain-text renderers.  The fact that other things can be
 done with them is neither here nor there.

The fact that plain-text renderers are used is not relevant here as
any normalization the renderer would use is hidden in the background
and the renderer does not expose the transformations it makes to the
editor itself.

Also, noboby uses an editor that performs implicit normalization of
text when saving file. If there's such an editor that can do it on the 
fly, this option should be disabled for source files. It's best for
editors to allow the user select the parts in the text to normalize,
and then apply normalization only in these selected parts.

A more simple editor could implement a global normalization, but this 
should be an explicit editing action from the user. For various reasons,
I would not like to use any Unicode plain-text editor that implicitly
normalizes the text without asking me, to work on programming source 
files or XML or HTML files. But I will accept it, if the editor really
understands the language or XML syntax (and exhibits it to the user with 
syntax coloring).


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: [OT]

2003-12-09 Thread Peter Kirk

On 09/12/2003 04:44, [EMAIL PROTECTED] wrote:

Despite your French notice about danger to the health (not to the 
sanity, though that might be true, too), Guinness was actually 
introduced as a health drink. I think the problem was that too many 
Irish people were spending their money on whiskey and not eating well, 
so Arthur Guinness introduced a drink that was so full of nutrients that 
you could live on it, and so heavy that you can't drink enough of it to 
get drunk!
   

Alas, you can easily drink enough of it to get drunk, even if you don't like 
being drunk, since its delicious taste will lead you to exceed your limit.

 

I think the current version has been watered down and strengthened in 
alcohol compared to the original. And I am not thinking of being 
slightly woozy and not safe to drive; I am thinking of being blind drunk 
and unable to crawl home, which (I am told!) is much easier with whiskey.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Kirk

On 09/12/2003 05:13, [EMAIL PROTECTED] wrote:

So, let's get this clear. Within an XML or HTML document, if I want an e 
with a red acute accent on it, it is quite permissible to write:

espan class=red-text{U+0301}/span

where {U+0301} is replaced by the actual Unicode character, and 
red-text is defined in the stylesheet. So it is not a problem that 
there is a defective combining sequence, nor that the accent is not 
combined with the e as it would be in NFC. Is that correct?
   

You can, whether you should is another thing, and whether it would render 
correctly yet another.

 

Well, users need to know whether they should do this, or what else they 
should do, when this is the effect they require; and implementers need 
to know whether they should work towards making this render correctly, 
to meet the demands of users including the Tamil users in question.

It seems that this is the simple and meaningful way of specifying the 
effect that it required. Rendering this is of course a challenge, but at 
least the requirement is clear.

Your alternative suggestion using svg seemed to require the user to 
handle the details of glyph positioning with specified horizontal 
advances, which is surely a very strange requirement. Or maybe I have 
misunderstood what was going on here.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jon

 Your alternative suggestion using svg seemed to require the user to 
 handle the details of glyph positioning with specified horizontal 
 advances, which is surely a very strange requirement. Or maybe I have 
 misunderstood what was going on here.

Perhaps so does yours. It isn't clear whether the CSS for .red-text would have 
to over-ride the default behaviour whereby an inline element like span is 
rendered by stacking it to the left or right (depending on text directionality) 
of the previous inline element or text node, or if the accent should go over 
the e by default.

Briefly testing on a Win2000 box I found that IE6 ignored the styling on the 
accent, Mozilla1.4 didn't show the accent, and Opera7.2 displayed the red 
accent (tests had the same results with #x0301; as with the combining 
character used directly). It isn't clear to me which, if any, of these are 
examples of conformant behaviour.

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie/

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Jungshik Shin

On Mon, 8 Dec 2003, Peter Jacobi wrote:

 It would be most interesting, if someone can point out a wordprocessor
 or even a rendering library (shouldn't Pango be the solution to
 everything?),
 which enables styling of individual Tamil letters.

  I think Pango's   attributed
string (
http://developer.gnome.org/doc/API/2.0/pango/pango-Text-Attributes.html
) can be used for this.  I believe that other layout/rendering libraries
such as Uniscribe, ATSUI and the rendering/layout part of ICU
have similar data type/APIs.

  Jungshik

Re: [OT]

2003-12-09 Thread Michael Everson

At 12:44 + 2003-12-09, [EMAIL PROTECTED] wrote:

A pint of plain's your only man.
Yes, yes, yes, now will you people start talking about fragile-glass 
symbols or plate-and-cutlery symbols or something and drag this back 
into some semblance of topicality?

Hm. We have a hot beverage symbol. Maybe we need a pint glass
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Philippe Verdy

 You might as well say that C code is not plain text because it too is
 subject to special canons of interpretation.

C, C++ and Java source files are not plain text as well (they have their own
text/* MIME type, which is NOT text/plain notably because of the rules
associated with end-of-lines, notably in presence of comments).

 But both XML/HTML/SGML and the various programming languages are plain
text.

See text/xml, text/html and text/sgml MIME types. They also aren't
text/plain so they have their own interpretation of Unicode characters
which is not the one found in the Unicode standard.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread Arcane Jill






Hmm. Now here's some C++ source code (syntax colored as Philippe
suggests, to imply that the text editor understands C++ at least well
:enough to color it)
int n = wcslen(L"caf");

(That's int n = wcslen(L"caf"); for those without HTML email)

The L prefix on a string literal makes it a wide-character string, and
wcslen() is simply a wide-character version of strlen(). (There is no
guarantee that "wide character" means "Unicode character", but let's
just assume that it does, for the moment).

So, should n equal four or five? The answer would appear to depend on
whether or not the source file was saved in NFC or NFD format.

There is more to consider than just how and whether a text editor
normalizes. If a text editor is capable of dealing with Unicode text,
perhaps it should also be able to explicitly DISPLAY the actual
composition form of every glyph. The question I posed in the previous
paragraph should ideally be obvious by sight - if you see four
characters, there are four characters; if you see five characters,
there are five characters. This implies that such a text editor should
display NFD text as separate glyphs for each character.

On the other hand, such a text editor must also acknowledge that ""
and "e + U+0301" are actually equivalent. The intention of
canonical equivalence is that the glyphs should display the same -
otherwise we'd need precomposed versions of, well, everything. So in
other contexts, is should display them the same.

Yuk. That's a lot to think about for anyone considering writing a
programmers' text editor with serious Unicode support.
Jill


-Original Message-
From:  Philippe Verdy [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, December 09, 2003 2:04 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: RE: Coloured diacritics (Was: Transcoding Tamil in the
presence of markup)

I would not like to use any Unicode plain-text editor that implicitly
normalizes the text without asking me, to work on programming source 
files or XML or HTML files. But I will accept it, if the editor really
understands the language or XML syntax (and exhibits it to the user
with 
syntax coloring).

Re: Ideographic Description Characters

2003-12-09 Thread John Jenkins

On Dec 8, 2003, at 6:20 PM, Mark Davis wrote:

John, I don't see why you are saying that it is a 'no-no'. There is no 
reason
that someone couldn't do something like that.


Strictly speaking, it isn't in violation of TUS, which only says (p. 
309), Ideographic Description Sequences are not to be used to provide 
alternative graphic representations of encoded ideographs.  Less 
formally, however, the discussion in The Book is focused on using them 
to represent unencoded ideographs, and we have consistently suggested 
(a) that IDSs should be as short as possible, and (b) that they 
shouldn't be used at all for encoded ideographs in text exchange.

It may be a good idea to update the language in The Book to 
specifically state that they are also useful for pedagogy and 
structural analysis of existing ideographs.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage..mac.com/jhjenkins/

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Mark Davis

I agree strongly. Reordering of glyphs doesn't affect the ability to maintain
styles. Every reasonable package has to retain the mappings back and forth
between character and glyph to maintain styles and to map highlighting/mouse
clicks/etc. The only issue is for combinations. That is, the character = glyph
mappings can be arbitrary combinations of the following:

reordering: easy to retain style
1:1 mapping: easy to retain style
1:n mapping: also easy to retain style
n:1 mapping: this is the place where it gets tricky.

Any time the 1:n mapping is involved, maintaining styles is difficult. For
example, with sampleredf/redgreeni/green/sample, if ligatures are
used for fi, then you have some choices. (a) disallow the ligature, (b) color
it all one or the other color, (c) if (and that's a big if) your font allows for
the production of an fi ligature with two adjacent 'fitting' pieces, essentially
contextual forms instead of a ligature, then you can do both the ligature and
the color.

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Peter Constable [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tue, 2003 Dec 09 00:30
Subject: RE: Transcoding Tamil in the presence of markup (was Re: Coloured
diacritics (Was: Transcoding Tamil in the presence of markup))


 From: [EMAIL PROTECTED] on behalf of Kenneth Whistler

  Unicode doesn't prevent styling, of course. But having 'logical' order
  instead of 'visual' makes it a hard task for the application and the
  renderer.
  This is witnessed by the thin-spread support for this.
 
 Yes...

 Ken conceded the claim too readily. Glyph re-ordering due to a logical
encoding order that is different from visual order may mean that certain types
of styling (of the re-ordered character) may not be supported in some
implementations, but it does *not* mean that this is, in general, a hard task.
Style information is applied to characters, and as long as there is a 1:m
association between characters and glyphs and there is a path to transform the
styling information to match the character/glyph transformations, styling is in
principle possible. (There's a constraint that styling might not be possible if
the styling differences require different fonts but the glyph transformations
that occur require rule contexts to span such a style boundary.)

 (Expecting one component from a precomposed character to be styled differently
from the rest, however, would be somewhat hard.)

 In particular, for reordering this is easy to demonstrate by considering a
hypothetical complex-script rendering implementation in which processing is
divided into two stages: character re-ordering, and glyph transformation. In the
first stage, all that happens is that a string is mapped to a temporary string
used internally only, in which characters are reordered into visual order.
(Surroundrant characters with no decomposition would be mapped into multiple
internal-use-only virtual characters.) Thus, a styled string such as
stringkspan color=rede/span/string would transform in the first stage
to stringspan color=rede/spank/string. There is nothing hard in such
processing.


 (Of course, whether it is harder to get people to implement support for one
thing rather than another is an entirely different question.)




 Peter Constable

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jon

  You might as well say that C code is not plain text because it too is
  subject to special canons of interpretation.
 
 C, C++ and Java source files are not plain text as well (they have their own

C, C++ and Java source files are plain text.

 text/* MIME type, which is NOT text/plain notably because of the rules

I've seen text/cpp and text/java, but really there are no such types. I've also 
seen text/x-source-code which is at least legal, if of little value to 
interoperability.

The correct MIME type for C and C++ source files is text/plain. I'd be prepared 
to give good odds that that is the case with Java source files as well.

 associated with end-of-lines, notably in presence of comments).

As source files (that is, at the stage in processing at which a human user can 
see the source and edit it) the only handling required for end-of-lines is 
converstion of new line function characters, the same as for any other use of 
plain text.

The treatment of end-of-lines as significant when processed (for example 
following one-line // comments) is a matter of what an application chooses to do
with a particular character. This is no different than an indexer deciding that 
a plain text file contains a particular word, or for that matter in my putting 
coffee filters into my basket if I see coffee filters written on my shopping 
list.

  But both XML/HTML/SGML and the various programming languages are plain
 text.
 
 See text/xml, text/html and text/sgml MIME types. They also aren't
 text/plain so they have their own interpretation of Unicode characters
 which is not the one found in the Unicode standard.

They have their own interpretation of tne Unicode characters which is *in 
addition to* the one found in the Unicode standard. As to all but the simplest 
applications that use Unicode (as interesting as many of them are, characters 
are of little use on their own).

RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))

2003-12-09 Thread Michael Everson

At 23:40 -0800 2003-12-08, Peter Constable wrote:
From: [EMAIL PROTECTED] on behalf of Michael Everson

 to use the kinds of uppercase
 glyph models used in similar instances of after-the-fact
 uppercase inventions based on IPA or other phonetic
 alphabets and usages.
 
A modified capital P would probably do.
[??!!]

Michael, you've seen what they are using. How will the community be 
served when type designers start creating fonts that have a 
cap-height glyph for 0294 supplemented by a modified capital P?
I meant that it would do for the code charts given Ken's model. It's 
not a very satisfactory model. I think language-specific font 
requirements for Latin is generally unsatisfactory, particularly 
where minorities are concerned. But I'm not in a position to fight 
this particular battle with Ken at the moment.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread Doug Ewell

Arcane Jill wrote:

 The intention of canonical equivalence is that the glyphs should
 display the same - otherwise we'd need precomposed versions of, well,
 everything.

The intention of canonical equivalence is that *all* operations that
involve interpreting the text treat two canonically equivalent strings
the same.  This is by no means limited to display.

One of the first things that surprised me when I was first learning
about Unicode in 1992 (from the big softcover 1.0 books) was how much
attention was paid to processing issues.  Topics like bidirectionality,
backing store, sorting and searching, and what became known as the
character-glyph model were all discussed.  It was a real eye-opener
for me to see a formal character standard that didn't just treat
characters as something to be typed, displayed, and printed.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Constable

From: Philippe Verdy [mailto:[EMAIL PROTECTED] 

 I see no particular value in this. The font rendering of spanbase
 diacritic/span should be exactly the same as that for
 spanbase/spanspandiacritic/span provided the font
 characteristics are the same or do not affect metrics.

This is wrong here: there's no guarantee offered by HTML...

My comment was intended to be referring to generic markup, not
specifically HTML.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

Re: [OT]

2003-12-09 Thread Frank da Cruz

[EMAIL PROTECTED] wrote:

 Stout was indeed given as a health drink in small doses in certain
 cases, it's one of the few foods that are a good source of both iron and
 calcium. However the only doctor I've heard of recommending it in recent
 years was...

I know of an (Irish) obstetrician in NYC who recommends it to his
patients!

Not to prolong this tangent, but to warn any North Americans who are
inspired by it to rush out and buy a bottle of Guinness -- don't bother.
It's not real Guinness any more, since about 2 years ago.  If you read the
fine print, you'll see it's from Toronto.  And it is AWFUL -- the
consistency of Pepsi and the taste of toxic waste.

It staggers the imagination to conceive of how this could happen.  Real
Irish Guinness was a constant in this world for centuries, and suddenly
some greedy investors turned it into a scam just for a quick buck (for
surely it will be quick!)

Sorry, I had to get that off my chest.  Hopefully someone with some pull
in Ireland will read this and do something about it :-)

- Frank

plain text (was RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Constable

 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
 Of [EMAIL PROTECTED]


 XML files most certainly are plain text

XML *can* be interpreted as plain text, or it can be interpreted as
something *other* than plain text (i.e. XML). This ambiguity exists for
any other plain-text-based markup format, such as RTF, Postscript, ...

Perhaps we need some new terminology here. It might be helpful to
describe an XML file as a plain-text-markup file (PTM, for acronym
lovers), but reserve the term plain text file for files that contain
text with no markup. Note that the terms being defined are xxx file,
not simply plain text. Thus, John can continue to say that XML is
plain text, but in some contexts that wouldn't be as useful as saying
XML files are plain-text-markup files.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread Peter Kirk

On 09/12/2003 07:00, Arcane Jill wrote:

Hmm. Now here's some C++ source code (syntax colored as Philippe 
suggests, to imply that the text editor understands C++ at least well 
:enough to color it)

int n = wcslen(Lcafé);

(That's int n = wcslen(Lcafé); for those without HTML email)

The L prefix on a string literal makes it a wide-character string, and 
wcslen() is simply a wide-character version of strlen(). (There is no 
guarantee that wide character means Unicode character, but let's 
just assume that it does, for the moment).

So, should n equal four or five? The answer would appear to depend on 
whether or not the source file was saved in NFC or NFD format.

No, surely not. If the wcslen() function is fully Unicode conformant, it 
should give the same output whatever the canonically equivalent form of 
its input. That more or less implies that it should normalise its input. 
(One can imagine a second parameter specifying whether NFC or NFD is 
required.) This makes the issue one not for the text editor but for the 
programming language or its string handling library.

There is more to consider than just how and whether a text editor 
normalizes. If a text editor is capable of dealing with Unicode text, 
perhaps it should also be able to explicitly DISPLAY the actual 
composition form of every glyph. The question I posed in the previous 
paragraph should ideally be obvious by sight - if you see four 
characters, there are four characters; if you see five characters, 
there are five characters. This implies that such a text editor should 
display NFD text as separate glyphs for each character.

On the other hand, such a text editor must also acknowledge that é 
and e + U+0301 are actually equivalent. The /intention/ of canonical 
equivalence is that the glyphs should display the same - otherwise 
we'd need precomposed versions of, well, everything. So in other 
contexts, is should display them the same.

The Unicode standard does allow for special display modes in which the 
exact underlying string, including control characters, is made visible.

Yuk. That's a lot to think about for anyone considering writing a 
programmers' text editor with /serious/ Unicode support.
Jill


Simply allow the text editor to save as either NFC or NFD, and let the 
programming language sort out the rest.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))

2003-12-09 Thread Peter Constable

Doh! (It was late.)


 -Original Message-
 From: Curtis Clark [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, December 09, 2003 8:00 AM
 To: Peter Constable
 Subject: Re: Glottal stops (bis) (was RE: Missing African Latin
letters (bis))
 
 on 2003-12-08 23:40 Peter Constable wrote:
  If a band of Rumple-stiltskin Latins
 
 I think you mean Rip van Winkle, but your point is well-made.
 
 --
 Curtis Clark  http://www.csupomona.edu/~jcclark/
 Mockingbird Font Works  http://www.mockfont.com/

Re: [OT]

2003-12-09 Thread jon

 It staggers the imagination to conceive of how this could happen.  Real
 Irish Guinness was a constant in this world for centuries, and suddenly
 some greedy investors turned it into a scam just for a quick buck (for
 surely it will be quick!)

There has always been variation in the way it was brewed internationally. The 
stuff we have in Ireland is would have been too weak to last for long in the 
African heat before refrigeration became so cheap.

The sweeter, stronger African variety, as brewed in Nigeria, is now to go on 
sale here though, as immigrants from Africa are complaining that you can't get 
a proper Guinness in Ireland.

 Sorry, I had to get that off my chest.  Hopefully someone with some pull
 in Ireland will read this and do something about it :-)

Bah, if we had any pull we could stop them making it increasingly colder and 
colder. They've already gone past the stage where you can't taste it (I 
understand heavily refrigerated beer is an American invention, and given the 
way American beer tastes this makes sense), soon it'll be served to you on a 
stick.

I can't even remember if this thread was ever on topic. How did we get into 
this?

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie/

Overload (was Re: Text Editors and Canonical Equivalence (was Coloured diacritics))

2003-12-09 Thread Mark Davis

 No, surely not. If the wcslen() function is fully Unicode conformant, it
 should give the same output whatever the canonically equivalent form of
 its input. That more or less implies that it should normalise its input.

No, that is not a requirement of Unicode conformance.

BTW, I must confess to an inability to keep up with the level of mail on this
list. There are so many things in these mails that are simply wrong, and
insufficient time for knowledgeable people to correct them. I would just caution
people to first consult the materials on the Unicode site (Standard, TRs, FAQs,
etc.), and take much of what is on this list with a quite sizable grain of salt.

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Arcane Jill [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tue, 2003 Dec 09 09:12
Subject: Re: Text Editors and Canonical Equivalence (was Coloured diacritics)


 On 09/12/2003 07:00, Arcane Jill wrote:

 
  Hmm. Now here's some C++ source code (syntax colored as Philippe
  suggests, to imply that the text editor understands C++ at least well
  :enough to color it)
 
  int n = wcslen(Lcaf);
 
  (That's int n = wcslen(Lcaf); for those without HTML email)
 
  The L prefix on a string literal makes it a wide-character string, and
  wcslen() is simply a wide-character version of strlen(). (There is no
  guarantee that wide character means Unicode character, but let's
  just assume that it does, for the moment).
 
  So, should n equal four or five? The answer would appear to depend on
  whether or not the source file was saved in NFC or NFD format.
 
 No, surely not. If the wcslen() function is fully Unicode conformant, it
 should give the same output whatever the canonically equivalent form of
 its input. That more or less implies that it should normalise its input.
 (One can imagine a second parameter specifying whether NFC or NFD is
 required.) This makes the issue one not for the text editor but for the
 programming language or its string handling library.

  There is more to consider than just how and whether a text editor
  normalizes. If a text editor is capable of dealing with Unicode text,
  perhaps it should also be able to explicitly DISPLAY the actual
  composition form of every glyph. The question I posed in the previous
  paragraph should ideally be obvious by sight - if you see four
  characters, there are four characters; if you see five characters,
  there are five characters. This implies that such a text editor should
  display NFD text as separate glyphs for each character.
 
  On the other hand, such a text editor must also acknowledge that 
  and e + U+0301 are actually equivalent. The /intention/ of canonical
  equivalence is that the glyphs should display the same - otherwise
  we'd need precomposed versions of, well, everything. So in other
  contexts, is should display them the same.
 
 The Unicode standard does allow for special display modes in which the
 exact underlying string, including control characters, is made visible.

  Yuk. That's a lot to think about for anyone considering writing a
  programmers' text editor with /serious/ Unicode support.
  Jill
 
 
 Simply allow the text editor to save as either NFC or NFD, and let the
 programming language sort out the rest.

 -- 
 Peter Kirk
 [EMAIL PROTECTED] (personal)
 [EMAIL PROTECTED] (work)
 http://www.qaya.org/

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread jcowan

Peter Kirk scripsit:

 No, surely not. If the wcslen() function is fully Unicode conformant, it 
 should give the same output whatever the canonically equivalent form of 
 its input.

Not so.  Remember, the conformance requirement is not that a process can't
distinguish between canonically equivalent strings (otherwise a normalizer
would be impossible; it wouldn't know whether to normalize or not!) but that
a process can't assume that *other* processes will distinguish between
canonically equivalent strings.  Equally, it can't assume that the other
process will fail to distinguish them, either.

In an environment in which C wide characters are Unicode characters, then
wcslen returns the number of distinct characters in the literal string.
How many characters it contains depends on how many were placed in the
source file by the author and what, if anything, has happened to the source
file since.

-- 
As you read this, I don't want you to feel  John Cowan 
sorry for me, because, I believe everyone   [EMAIL PROTECTED]
will die someday.-- From a Nigerian-typehttp://www.reutershealth.com
scam spam I got http://www.ccil.org/~cowan

Unsubscribe

2003-12-09 Thread Anupam Agarwal
















-

The information contained in this message is proprietary of Amdocs,

protected from disclosure, and may be privileged.

The information is intended to be conveyed only to the designated recipient(s)

of the message. If the reader of this message is not the intended recipient,

you are hereby notified that any dissemination, use, distribution or copying of 

this communication is strictly prohibited and may be unlawful. 

If you have received this communication in error, please notify us immediately

by replying to the message and deleting it from your computer.

Thank you.

-

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Peter Kirk

On 09/12/2003 06:36, [EMAIL PROTECTED] wrote:

Perhaps so does yours. It isn't clear whether the CSS for .red-text would have 
to over-ride the default behaviour whereby an inline element like span is 
rendered by stacking it to the left or right (depending on text directionality) 
of the previous inline element or text node, or if the accent should go over 
the e by default.
 

Well, I would put it like this. Consider the following:

(1) span class=black-text{U+00E9}/span
(2) span class=black-texte{U+0301}/span
(3) span class=black-textespan 
class=black-text{U+0301}/span/span
(4) span class=black-textespan class=red-text{U+0301}/span/span

I would expect (1), (2) and (3) to be rendered identically, and (4) to 
differ only in the colour of the accent, just as it would be (apart from 
(1) if U+0301 were replaced by a regular letter. I am assuming nothing 
special defined in the CSS - the behaviour should be the same with a 
simple colour attribute. And so I would expect the behaviour of an 
in-line span element to be subtly different from its normal behaviour 
when the text starts with a combining mark. I think this is what any 
naive user would expect in the circumstances, and is also what is sensible.

Briefly testing on a Win2000 box I found that IE6 ignored the styling on the 
accent, Mozilla1.4 didn't show the accent, and Opera7.2 displayed the red 
accent (tests had the same results with #x0301; as with the combining 
character used directly). It isn't clear to me which, if any, of these are 
examples of conformant behaviour.

 

Looking at existing implementations is a very bad guide to what 
behaviour is actually conformant, sensible, or expected by users. We 
have four independent variables here!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: plain text (was RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread jcowan

Peter Constable scripsit:

 Perhaps we need some new terminology here. It might be helpful to
 describe an XML file as a plain-text-markup file (PTM, for acronym
 lovers), but reserve the term plain text file for files that contain
 text with no markup. Note that the terms being defined are xxx file,
 not simply plain text. Thus, John can continue to say that XML is
 plain text, but in some contexts that wouldn't be as useful as saying
 XML files are plain-text-markup files.

Fair enough, though technically even plain-text files typically mark
either line ends or paragraph breaks with markup (= control) characters.

-- 
My corporate data's a mess! John Cowan
It's all semi-structured, no less.  http://www.ccil.org/~cowan
But I'll be carefree[EMAIL PROTECTED]
Using XSLT  http://www.reutershealth.com
In an XML DBMS.

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Marco Cimarosti

 Hmm. Now here's some C++ source code (syntax colored as 
 Philippe suggests, to imply that the text editor understands 
 C++ at least well :enough to color it)
 
 int n = wcslen(Lcafé);
 
 (That's int n = wcslen(Lcafé); for those without HTML email)
 
 The L prefix on a string literal makes it a wide-character 
 string, and wcslen() is simply a wide-character version of 
 strlen(). (There is no guarantee that wide character means 
 Unicode character, but let's just assume that it does, for 
 the moment).

Even assuming that you can assume that wide characters are Unicode, you
have not yet assumed in what kind of UTF they are. (Don't assume I
deliberately making calembours :-)

The only thing that the C(++) standards say about type wchar_t is that it
is not smaller that type char, so a wide character could well be a byte,
and a wide character string could well be UTF-8, or even ASCII.

 So, should n equal four or five?

Why not six?

If, in our C(++) compiler, type wchar_t is an alias for char, and wide
character strings are encoded in UTF-8, and the é is decomposed, then n
will be equal to 6.

 The answer would appear to depend on whether or not the
 source file was saved in NFC or NFD format.

The answer is:

int n = wcslen(Lcafé);

That's why you take the burden to call the wcslen library function rather
than assuming a hard-coded value such as:

int n = 4;  // the length of string café

 There is more to consider than just how and whether a text 
 editor normalizes.

Whatever the editor does, what if then the *compiler* normalizes it?

The source file and the compiled object file are not necessarily in the same
encoding and/or normalization.

A certain compiler could accept a certain range of input encodings (maybe
declared with command-line parameter) and convert them all in a certain
internal representation in the compiler object file (e.g., Unicode expressed
in a particular UTF and with a particular normalization).

That's why library functions such as strlen or wcslen exist. You don't
need to bother what these functions will return in a particular compiler or
environment, as far as the following code is guaranteed to work:

const wchar_t * myText = Lcafé;
wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1));
if (myBuffer != NULL)
{
wcscpy(myBuffer, myText);
}

 If a text editor is capable of dealing with Unicode text,
 perhaps it should also be able to explicitly DISPLAY the
 actual composition form of every glyph.

Against, this is not possible nor desirable, because a text editor is not
supposed to know how the compiler (or its runtime libraries) will transform
string literals.

 The question I posed in the previous paragraph should 
 ideally be obvious by sight - if you see four characters, 
 there are four characters; if you see five characters, there 
 are five characters.

Provided that you can define what a character is... After a few years
reading this mailing list, I haven't seen a single acceptable definition of
character.

Moreover, I matured the impression that it is totally irrelevant to have
such a definition:

- as an end user, I am interested in a higher level kind of objects (let's
call them graphemes, i.e. those things I see on the screen and I can
interact with my mouse);

- as a programmer, I am interested in a lower lever kind of objects (let's
call them encoding units, i.e. those things that I count when I have to
allocate memory for a string, or the like).

The term character is in a sort of conceptual limbo which makes it pretty
useless for everybody, IMHO.

_ Marco

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Marco Cimarosti

I (Marco Cimarosti) wrote:
  So, should n equal four or five?
 
 Why not six?
  ^^^

Errata: seven.

 If, in our C(++) compiler, type wchar_t is an alias for 
 char, and wide character strings are encoded in UTF-8, 
 and the é is decomposed, then n will be equal to 6.
 ^

Errata: 7.

Sorry.
 
_ Marco

Re: Overload (was Re: Text Editors and Canonical Equivalence (was Coloured diacritics))

2003-12-09 Thread Peter Kirk

On 09/12/2003 10:01, Mark Davis wrote:

No, surely not. If the wcslen() function is fully Unicode conformant, it
should give the same output whatever the canonically equivalent form of
its input. That more or less implies that it should normalise its input.
   

No, that is not a requirement of Unicode conformance.

BTW, I must confess to an inability to keep up with the level of mail on this
list. There are so many things in these mails that are simply wrong, and
insufficient time for knowledgeable people to correct them. I would just caution
people to first consult the materials on the Unicode site (Standard, TRs, FAQs,
etc.), and take much of what is on this list with a quite sizable grain of salt.
 

Mark, I understand your problem with the level of mail. But, in this 
case, I have read the appropriate section of TUS 4.0 and quote it here 
to prove it, from p.59:

C9 A process shall not assume that the interpretations of two 
canonical-equivalent character
sequences are distinct.
...
 Ideally, an implementation would always interpret two 
canonical-equivalent character
sequences identically. ...


Perhaps my error is that I have raised (or is it lowered?) ideally 
would to should. So let me rephrase what I said before:

If the wcslen() function is fully Unicode conformant, ideally it would 
give the same output whatever the canonically equivalent form of its input.

Surely that is what C9 is saying. Or is the issue about whether such a 
function is a process? I didn't say that conformance implies that a 
process should normalise its input (I accept that that is not true), but 
only that for this particular function, counting the length of a string, 
sensible results can be given only if the string is normalised, or at 
least transformed in some other way which removes distinctions between 
canonically equivalent forms (e.g. normalisation with some kinds of 
modified data).

I am tacitly assuming at this point that the function is part of a 
general-purpose library for use by users who are not interested in the 
details of character coding etc. I can see that different considerations 
may apply for an internal function within a Unicode processing and 
rendering implementation.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Marco Cimarosti

Peter Kirk wrote:
  So, should n equal four or five? The answer would appear to
  depend on  whether or not the source file was saved in NFC
  or NFD format.
 
 No, surely not. If the wcslen() function is fully Unicode 
 conformant, it should give the same output whatever the
 canonically equivalent form of its input.
 That more or less implies that it should normalise 
 its input.

Standards and fantasy are both good things, provided you don't mix them up.

The wcslen has nothing whatsoever to do with the Unicode standard, but it
has all to do with the *C* standard. And, according to the C standard,
wcslen must simply count the number wchar_t array elements from the
location pointed to by its argument up to the first wchar_t element whose
value is L'\0'. Full stop.

 (One can imagine a second parameter specifying whether NFC or NFD is 
 required.)

One can imagine whatever (s)he wants, but should please avoid to claim that
his/her imagination corresponds to some existing standards.

 This makes the issue one not for the text editor 
 but for the programming language or its string handling library.

This is correct.

 The Unicode standard does allow for special display modes in 
 which the exact underlying string, including control
 characters, is made visible.

Can you please cite the passage where the Unicode standard would not allow
this?

_ Marco

Unsubscribe

2003-12-09 Thread Gupta, Rohit4

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread Peter Kirk

On 09/12/2003 10:16, [EMAIL PROTECTED] wrote:

Peter Kirk scripsit:

 

No, surely not. If the wcslen() function is fully Unicode conformant, it 
should give the same output whatever the canonically equivalent form of 
its input.
   

Not so.  Remember, the conformance requirement is not that a process can't
distinguish between canonically equivalent strings ...
Remembered. This is not a conformance requirement, just an ideally. 
See C9 and the posting I just made.

... (otherwise a normalizer
would be impossible; it wouldn't know whether to normalize or not!) ...
Not so. Normalisation is idempotent i.e. the result of normalising an 
already normalised string (with the same normalisation form) is 
identical to that of not normalising it. So the normaliser doesn't need 
to know in advance if the string is normalised. Now it may be more 
efficient to test for normalisation first; but the conformance clause 
says nothing to stop you making implementation shortcuts.

... but that
a process can't assume that *other* processes will distinguish between
canonically equivalent strings.  Equally, it can't assume that the other
process will fail to distinguish them, either.
In an environment in which C wide characters are Unicode characters, then
wcslen returns the number of distinct characters in the literal string.
How many characters it contains depends on how many were placed in the
source file by the author and what, if anything, has happened to the source
file since.
 

This implies that wcslen is not doing what C9 says that it ideally... 
would always do. But see the caveats in my other posting.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread jcowan

Peter Kirk scripsit:

 ... (otherwise a normalizer
 would be impossible; it wouldn't know whether to normalize or not!) ...
 
 Not so. Normalisation is idempotent 

Quite right.  I should have said that normalization *checking* would be
impossible.

-- 
Only do what only you can do.   John Cowan [EMAIL PROTECTED]
  --Edsger W. Dijkstra, http://www.reutershealth.com
deceased 6 August 2002  http://www.ccil.org/~cowan

RE: [OT]

2003-12-09 Thread Marco Cimarosti

 [...]
 some greedy investors turned it into a scam just for a quick buck (for
 surely it will be quick!)
 
 Sorry, I had to get that off my chest.  Hopefully someone 
 with some pull in Ireland will read this and do something
 about it :-)

Or simply flush Guinne$$ and drink Murphix. :-)

Ciao.
Marco

Unsubscribe

2003-12-09 Thread Lak Sri

Unsubscribe

2003-12-09 Thread Mahesh C Adhav


begin:vcard 
n:Adhav;Mahesh
tel;cell:609.468.7005
tel;work:732.227.7720
x-mozilla-html:FALSE
org:PRI;PD Informatics
adr:;;;New Brunswick;NJ;;USA
version:2.1
email;internet:[EMAIL PROTECTED]
title:Consultant
fn:Mahesh C Adhav
end:vcard

Re: [OT]

2003-12-09 Thread John Hudson

At 06:54 AM 12/9/2003, Michael Everson wrote:

Hm. We have a hot beverage symbol. Maybe we need a pint glass
... and combining shamrock and harp marks.

JH

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
What was venerated as style  was nothing more than
an imperfection or flaw that revealed the guilty hand.
   - Orhan Pamuk, _My name is red_

Re: [OT]

2003-12-09 Thread Michael Everson

At 11:06 -0800 2003-12-09, John Hudson wrote:
At 06:54 AM 12/9/2003, Michael Everson wrote:

Hm. We have a hot beverage symbol. Maybe we need a pint glass
... and combining shamrock and harp marks.
I did get the shamrock in. ;-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Peter Kirk

On 09/12/2003 10:22, Marco Cimarosti wrote:

Peter Kirk wrote:
 

So, should n equal four or five? The answer would appear to
depend on  whether or not the source file was saved in NFC
or NFD format.
 

No, surely not. If the wcslen() function is fully Unicode 
conformant, it should give the same output whatever the
canonically equivalent form of its input.
That more or less implies that it should normalise 
its input.
   

Standards and fantasy are both good things, provided you don't mix them up.

The wcslen has nothing whatsoever to do with the Unicode standard, but it
has all to do with the *C* standard. And, according to the C standard,
wcslen must simply count the number wchar_t array elements from the
location pointed to by its argument up to the first wchar_t element whose
value is L'\0'. Full stop.
 

OK, as a C function handling wchar_t arrays it is not expected to 
conform to Unicode. But if it is presented as a function available to 
users for handling Unicode text, for determining how many characters (as 
defined by Unicode) are in a string, it should conform to Unicode, 
including C9.

...

The Unicode standard does allow for special display modes in 
which the exact underlying string, including control
characters, is made visible.
   

Can you please cite the passage where the Unicode standard would not allow
this?
 

TUS 4.0 p.60 (part of C9):

Even processes that normally do not distinguish between 
canonical-equivalent character sequences can have reasonable exception 
behavior. Some examples of this behavior include ... Show Hidden 
Text modes that reveal memory representation structure; ...


Somewhere else I think there is more detail.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread Philippe Verdy

[EMAIL PROTECTED] writes:
   You might as well say that C code is not plain text because it too is
   subject to special canons of interpretation.
  
  C, C++ and Java source files are not plain text as well (they 
  have their own
 
 C, C++ and Java source files are plain text.
 
  text/* MIME type, which is NOT text/plain notably because 
  of the rules
 
 I've seen text/cpp and text/java, but really there are no such 
 types. I've also 
 seen text/x-source-code which is at least legal, if of little value to 
 interoperability.
 
 The correct MIME type for C and C++ source files is text/plain. 

This is where I disagree: a plain text file makes no difference of
interpretation between their meta-linguistic meaning for the programming
language that uses and need it, and the same characters used to create
string constants or identifier names.

Unicode cannot, and must not, specify how the meta-characters used in a
programming language must combine with other actual strings that are treated
by the language syntax itself as _separate tokens_. This means that the
concept of combining sequences MUST NOT be used across all language token
boundaries. These boundaries are out of the spec of Unicode, but part of the
spec for the language, and they must be respected at the first level even
before trying to create other combining sequences within the _same_ token.

So even if text/c, text/cpp, text/pascal or text/basic are not
officially registered (but text/java and text/javascript are
registered...) it is important to handle text sources that aren't plain
texts as another text/* type, for example text/x-other or
text/x-source or text/x-c or text/x-cpp.

 I'd be prepared 
 to give good odds that that is the case with Java source files as well.

As I said text/java is the appropriate MIME type for Java source files...

  associated with end-of-lines, notably in presence of comments).
 
 As source files (that is, at the stage in processing at which a 
 human user can see the source and edit it) the only handling required
 for end-of-lines is converstion of new line function characters, the same
 as for any other use of plain text.
 
 The treatment of end-of-lines as significant when processed (for example 
 following one-line // comments) is a matter of what an 
 application chooses to do with a particular character. This is no
 different than an indexer deciding that a plain text file contains a
 particular word, or for that matter in my putting coffee filters into my
 basket if I see coffee filters written on my shopping list.

Just imagine what would be created with your assumption with this source:
const wchar_t c = L'?';
where ? is a combining character. Using the plain/text content type for this
C source would imply that it combines with the previous single-quote. This
would create an opportunity for canonical composition, and thus would create
an equivalent source file which would be:
const wchar_t c = L§';
where this § character is a composed character. Now the source file contains
a
syntax error and does not compile, even though the previous source compiled
and was giving to the c constant the value of the codepoint coding the
? diacritic...

Of course the programmer could avoid this nightmare by using numeric
character
references as in:
const wchar_t c = L'\U000309';
or may be (but less portable, as it assumes the runtime encoding form used
by
wchar_t as being UCS4 or UTF-16 or UTF2, when the source file may be coded
in a non-Unicode charset):
const wchar_t c = (wchar_t)0x000309ul;

   But both XML/HTML/SGML and the various programming languages are plain
  text.
  
  See text/xml, text/html and text/sgml MIME types. They also aren't
  text/plain so they have their own interpretation of Unicode characters
  which is not the one found in the Unicode standard.
 
 They have their own interpretation of tne Unicode characters which is *in 
 addition to*

This is not *in addition* but *instead of* and thus this breaks the rule
of Unicode conformance at that level, as the code point does not match the
meaning REQUIRED by conforming applications as being a code point, coding
an abstract character with a well-defined representative glyph and
REQUIRED composability with surrounding characters.

 the one found in the Unicode standard. As to all but the simplest
 applications that use Unicode (as interesting as many of them are,
 characters are of little use on their own).

Note that a simple text editor such as NotePad can safely be used to edit
source files, simply because it does not attempt to perform any
normalization
of the loaded or saved files, even when editing it (there's not even a edit
menu option to normalise any area of the text in the edit buffer).

Most editors for programming languages treat individual characters as really
individual and completely unrelated to each other. This means that they
won't
attempt any normalization, so characters will not be reordered, or

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-09 Thread Peter Kirk

On 09/12/2003 10:41, [EMAIL PROTECTED] wrote:

Peter Kirk scripsit:

 

... (otherwise a normalizer
would be impossible; it wouldn't know whether to normalize or not!) ...
 

Not so. Normalisation is idempotent 
   

Quite right.  I should have said that normalization *checking* would be
impossible.
 

Agreed. C9 clearly specifies that a process cannot assume that another 
process will give a correct answer to the question is this string 
normalised?, because that is to assume that another process will make 
a distinction between two different, but canonical-equivalent character 
sequences.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Glottal stops (bis)

2003-12-09 Thread Jim Allan

Peter Constable posted:

Michael, you've seen what they are using. How will the community be 
served when type designers start creating fonts that have a cap-height 
glyph for 0294 supplemented by a modified capital P?
The characters can be seen on the web at 
http://www.wkss.nt.ca/HTML/08_ProjectsReports/PDF/DogribPlaceCaribouHabitat2002.pdf

Search on Small Clear Lake in the file on Jackfish and Moosenose 
and glottal stop for a few of many examples.

These characters are obviously being used.

See also http://members.tripod.com/~DeneFont/win_char.htm

There appears to me to be two possibilities for Unicode:

1. Encode a new character for the lowercase glottal stop and 
recategorize U+2094 as uppercase.

2. Encode two new characters and leave U+2094 as is.

The second suggestion has the advantage that a font designer would be 
more free than otherwise to render the  uppercase glottal stop to match 
more closely other uppercase characters in a particular lettering style.

Jim Allan

RE: unification (CJKV history) ; Alphabetic Aramaic+ ...

2003-12-09 Thread Tom Emerson

 I'm working on unification and would like to more
 about the earliest CJKV work--was it from the RLG?

This history of unification is laid out pretty clearly in
Appendix A of TUS.

 I read a book on computerizing languages by a Sproat
 from Bell Labs--not as satisfying as I had hoped,
 although he had the good taste to mention Hebrew
 accents.  

Which book? A Computational Theory of Writing Systems or
Morphology and Computation? Neither is really related to
the topic of Han unification.

What exactly are you looking for?

Tree

RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-09 Thread D. Starner

 Just imagine what would be created with your assumption with this source:
   const wchar_t c = L'?';
 where ? is a combining character. 

The programmer would get bit. At best, there's no reason to assume that
every compiler accepts UTF-8, besides that fact that you can't assume that
the compiler or any intermediary step doesn't normalize. That's why Unicode
escapes exist, and partially why Java as a general rule translates source into
a form that uses Unicode escapes for non-ASCII characters.

Even if you assume the compiler can accept Unicode text in whatever UTF you
choose, it still seems needlessly dangerous to use a bare combining character 
instead of a Unicode escape or a numeric entity. Despite your distinction, there's 
no clear line between programming editors and non-programming editors. Any editor
that gives you variable names in Hindi or Arabic is likely to have the sophistication
need to combine that ? with that ', and I see no reason they won't; quite possibly,
the underlying system won't give them the option to handle Hindi or Arabic and not
combining that ? with that '. Emacs, for one notorious programming editor, fully
plans to have that sophistication.


-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: unification (CJKV history) ; Alphabetic Aramaic+ ...

2003-12-09 Thread Elaine Keown

 Elaine Keown
 still in Texas

Dear Tom Emerson:

  This history of unification is laid out pretty
  clearly in Appendix A of TUS.

I hope this is online--And they go all the way back to
the 200 previous suggestions, some from the Chinese
Language Computer Society?  

 A Computational Theory of Writing Systems or

I'm looking for previous thought on properties of
scripts that affect how they are encodedSproat did
that only a little--that was the disappointment.  

I think that encoding standards are actually the
technical end of what they call sociolinguistics in
linguistics dept + discussing a script's computational properties.Elaine

__
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

XML based mapping files.

2003-12-09 Thread Gupta, Shubhagam









Hello, 

 I
am trying to Implement Unicode Technical Report, UTR 22 and I have a few
questions about this specification. Since this specification is normative, XML
must be the way to go when including local encoding to unicode mapping files in
your application, this requires conversion of existing mapping files to XML
form. Have other application enforced this conversion to their mapping files.
Is there a tool which could be useful in doing the conversion or it can only be
done manually? ICU have extensive repository of XML based mapping files, but
does any other reference source for XML formatted mapping files or the alias
table exists, apart from ICU. ? 



Thank you in advance.



Regards, 



Shubhagam Gupta

[EMAIL PROTECTED]

RE: Qumran Greek

2003-12-09 Thread Philippe Verdy

Michael Everson wrote:
 At 13:34 -0800 2003-12-08, Elaine Keown wrote:
 I include 2 Qumran symbols that are probably Greek.
 
 Obviously it's impossible to tell from two tiny gifs
 
 I'm looking for help with the large 'X'.
 
 I would guess that the first of your symbols, if Greek, is a 
 PARAGRAPHOS or a FORKED PARAGRAPHOS. It's also used in Coptic.
 
 The X looks like a CHI of course.

I had the same feeling when I replied to Elaine that this may be an
annotation added by a Coptic scribe within the Hebrew text. But it was hard
to guess if this was the case. Coptic religious have made extensive studies
in Egypt related to ancient texts in Hebrew, and it's quite natural that
they may have mixed their own annotations in Coptic in the margin of the
original Hebrew texts.
It's exactly similar to annotating today a Han text with notes in English.
So I'm not sure it needs a specific encoding, as this may just be a shift
from one script to another.
Elaine could look within her copy of the whole text if there are not other
occurences than just single symbols, i.e. added words, in the margin of the
text.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: Qumran Greek

2003-12-09 Thread Michael Everson

I have no problem with Qumran scribes being multilingual or using 
Greek symbols in either Coptic or Hebrew or Aramaic text.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: Qumran Greek

2003-12-09 Thread Michael Everson

At 00:27 +0100 2003-12-10, Philippe Verdy wrote:
It's exactly similar to annotating today a Han text with notes in English.
So I'm not sure it needs a specific encoding, as this may just be a shift
from one script to another.
Of course the PARAGRAPHOS characters are to be encoded in the 
Supplemental Punctuation block where they can be used for many 
scripts. There's a FORKED PARAGRAPHOS though and a REVERSED FORKED 
PARAGRAPHOS though, which names may not be all that good if they can 
be used in a bidirectional context.

Ken?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Qumran Greek

2003-12-09 Thread Elaine Keown

  Elaine in central Texas

Hi,

 I would guess that the first of your symbols, if
 Greek, is a PARAGRAPHOS or a FORKED PARAGRAPHOS. 
 It's also used in Coptic.

Yes, both of those seem to be at Qumran. 
In Coptic, do you know what period of time they are
in?

 The X looks like a CHI of course.

Even though it's sort of curvy and oversized?--though
of course you can't tell the size from this.  I had
been assuming that it was something else, since
Emanuel Tov didn't name it as such and he mostly did SeptuagintElaine

__
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

Re: Glottal stops (bis) (was RE: Missing African Latin letters (bis))

2003-12-09 Thread Mark E. Shoulson

On 12/09/03 02:26, Peter Constable wrote:

From: [EMAIL PROTECTED] on behalf of Kenneth Whistler
 

Nobody is agitating for an uppercase
apostrophe.
   

Not in Canada, that I know of. (I've seen indication of languages in Russia that have a case distinction for ' and possible also .)

 

Early versions of Volapk used  (U+02BB, I think) for the sound /h/, 
and specified that the uppercase apostrophe-shape was a boldface one.

I can provide a scan, I think, if people think it matters.

~mark

Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Doug Ewell

Peter Kirk peterkirk at qaya dot org wrote:

 The wcslen has nothing whatsoever to do with the Unicode standard,
 but it has all to do with the *C* standard. And, according to the C
 standard, wcslen must simply count the number wchar_t array
 elements from the location pointed to by its argument up to the first
 wchar_t element whose value is L'\0'. Full stop.

 OK, as a C function handling wchar_t arrays it is not expected to
 conform to Unicode. But if it is presented as a function available to
 users for handling Unicode text, for determining how many characters
 (as defined by Unicode) are in a string, it should conform to Unicode,
 including C9.

wcslen() is very definitely presented as a function for counting
_code_units_.  You can't even rely on it to count Unicode characters
accurately, if a wchar_t is 16 bits long, because supplementary
characters will require 2 code points (high + low surrogate).

Programmers rely on primitive functions like wcslen() to do what they do
very rapidly, and not to change their meaning in new versions of the
language standard.  It would be very handy to have a suite of C
functions that normalize their input string to any of NFK*[CD], or to
compare strings or measure their length taking normalization into
account, but those would have to be all-new functions.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: XML based mapping files.

2003-12-09 Thread Doug Ewell

Shubhagam Gupta wrote:

 I am trying to Implement Unicode Technical Report, UTR 22 and I have
 a few questions about this specification. Since this specification is
 normative, XML must be the way to go when including local encoding to
 unicode mapping files in your application, this requires conversion of
 existing mapping files to XML form.

No Unicode Technical Report is normative in the sense that one must
follow it in order to conform to the Unicode Standard.  (If it were, it
would be a Unicode Standard Annex.)  UTRs contain additional information
on the use of Unicode in certain environments, or guidelines for the use
of Unicode with other standards.

In particular, UTR #22 does *not* require existing mapping tables to be
converted to XML.  It provides an appropriate XML-based format for
mapping tables, along with other suitable guidelines for things like
fallback assignments.  But there is no requirement to convert existing
tables, and in fact the official tables available on the Unicode FTP
site continue to be available in the plain-text Format A.

 Have other application enforced this conversion to their mapping
 files. Is there a tool which could be useful in doing the conversion
 or it can only be done manually? ICU have extensive repository of XML
 based mapping files, but does any other reference source for XML
 formatted mapping files or the alias table exists, apart from ICU. ?

I don't know of any such tools, but there is a possibility that
something could be put together using ICU.  Indeed, while UTR #22
contains plenty of good material, I tend to think of it as public
documentation of a format used by ICU and probably few others.

BTW, anyone catch the error in section 4.2.1 of this UTR?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

71 matches

Mail list logo