Re: Discrepancy between Names List Code Charts?

2002-08-16 Thread John Hudson

At 06:19 PM 15-08-02, James Kass wrote:

Does anyone know of a writing system which actually uses the
Latin letter t with a bona-fide cedilla?

The newish Gagauz Turkish Latin-script orthography derives from both 
Turkish and Romanian models. This has led to a peculiar hybrid, in which 
the cedilla is used for the s and the commaaccent is used for the t. If the 
Gagauz Turks became interested in stressing their Turkishness, they might 
decide that both s and t should use the cedilla, but I've not seen any 
examples of this yet. I don't know of any other languages for which the 
t-cedilla form might be appropriate, so I've always mapped both U+0163 and 
U+021B to the same t-commaaccent glyph.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

Language must belong to the Other -- to my linguistic community
as a whole -- before it can belong to me, so that the self comes to its
unique articulation in a medium which is always at some level
indifferent to it.  - Terry Eagleton





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread William Overington

Kenneth Whistler wrote as follows about my idea.

 It occurs to me that it is possible to introduce a convention, either as
a
 matter included in the Unicode specification, or as just a known about
 thing, that if one has a plain text Unicode file with a file name that
has
 some particular extension (any ideas for something like .uof for Unicode
 object file)

...or to pick an extension, more or less at random, say .html

Well, that could produce confusion with a .html file used for Hyper Text
Markup Language, HTML.

I suggested .uof so that a .uof file would be known as being for this
purpose.


 that accompanies another plain text Unicode file which has a
 file name extension such as .txt, or indeed other choices except .uof (or
 whatever is chosen after discussion) then the convention could be that
the
 .uof file has on lines of text, in order, the name of the text file then
the
 names of the files which contains each object to which a U+FFFC character
 provides the anchor.

 For example, a file with a name such as story7.uof might have the
following
 lines of text as its contents.

 story7.txt
 horse.gif
 dog.gif
 painting.jpg

This is a shaggy dog story, right?

No, it is a story about an artist who wanted to paint a picture of a horse
and a picture of a dog and, since he knew that the horse and the dog were
great friends and liked to be together and also that he only had one canvas
upon which to paint, the artist painted a picture of a landscape with the
horse and the dog in the foreground, thereby, as the saying goes, painting
two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm
in that he achieved two results by one activity.  In addition the picture
has various interesting details in the background, such as a windmill in a
plain (or is that a windmill in a plain text file).  :-)

 The file story7.uof could thus be used with a file named story.txt so as
to
 indicate which objects were intended to be used for three uses of U+FFFC
in
 the file story7.txt, in the order in which they are to be used.

Or we could go even further, and specify that in the story7.html file,
the three uses of those objects could be introduced with a very specific
syntax that would not only indicate the order that they occur in, but
could indicate the *exact* location one could obtain the objects -- either
on
one's own machine or even anywhere around the world via the Internet! And
we could
even include a mechanism for specifying the exact size that the object
should be
displayed. For example, we could use something like:

img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380
 height=260 border=1

or

img src=http://www.artofeurope.com/velasquez/vel2.jpg;

Now that is a good idea.  In a .uof file specifically for the purpose, a
line beginning with a  character could be used to indicate a web based
reference, or a local reference, for the object, using exactly the same
format as is used in an HTML file.

If the line does not start with a  character, then it is simply a file name
in the same directory as the .uof file, as I suggested originally.  This
would mean that where, say, a .uof file were broadcast upon a telesoftware
service that the Java program (also broadcast) analysing the file names in
the .uof file need not necessarily be able to decode lines starting with a 
character so that the Java program does not need to have the software for
that decoding in it, yet the same .uof file specification could be used,
both in a telesoftware service and on the web, where a more comprehensive
method of referencing objects were needed.

 I can imagine that such a widely used practice might be helpful in
bridging
 the gap between being able to use a plain text file or maybe having to
use
 some expensive wordprocessing package.

And maybe someone will write cheaper software -- we could call it a
browser --
that could even be distributed for free, so that people could make use of
this convention for viewing objects correctly distributed with respect to
the text they are embedded in.

Indeed, except not call it a browser as the name is already in widespread
use for HTML browsers and might cause confusion.  Analysing a .uof file
would be a much less computational task than analysing the complete syntax
of HTML files.

Yes, yes, I think this is an idea which could fly.

--Ken


Good.  It is a solution which could be very useful for people writing
programs in Java, Pascal and C and so on which programs take in plain text
files and process them for such purposes as producing a desktop publishing
package.

Hopefully the Unicode Technical Committee will be pleased to add a .uof
format file specification into the set of Unicode documents so that the
U+FFFC code can be used in an effective manner.  The idea could be that if a
.uof file is processed then the rules of .uof files apply in that situation,
so that if a .uof file is not being processed, then the rules for .uof files
do not apply, therefore 

Any day can be April 1st? (was: An idea for keeping U+FFFC usable)

2002-08-16 Thread Michael \(michka\) Kaplan

From: William Overington [EMAIL PROTECTED]

 Could this be discussed at the Unicode Technical Committee
 meeting next week please?

whoosh

William,

Please read Ken's message again. He was *talking* about HTML, and pointing
out how all of these things are supported in browsers already.

You will likely be kicking yourself when you see what the message was
actually saying. :-)


MichKa

Michael Kaplan
Trigeminal Software, Inc.  -- http://www.trigeminal.com/





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Barry Caplan



Yes, yes, I think this is an idea which could fly.

--Ken


Good.  It is a solution which could be very useful for people writing
programs in Java, Pascal and C and so on which programs take in plain text
files and process them for such purposes as producing a desktop publishing
package.


Uhh, I think Ken's message was entirely sarcasm or some higher form of rhetorical 
humor whose obscure name slips my mind right now.

The suggestion to use html as an extension was the give away - I was laughing out 
loud from that point on - his point was that the technology to do what you want 
already exists it is called HTML and it is displayed by browsers and so forth.

Barry Caplan
www.i18n.com





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread James Kass


William Overington wrote,

 
 No, it is a story about an artist who wanted to paint a picture of a horse
 and a picture of a dog and, since he knew that the horse and the dog were
 great friends and liked to be together and also that he only had one canvas
 upon which to paint, the artist painted a picture of a landscape with the
 horse and the dog in the foreground, thereby, as the saying goes, painting
 two birds on one canvas, http://www.users.globalnet.co.uk/~ngo/bird0001.htm
 in that he achieved two results by one activity.  In addition the picture
 has various interesting details in the background, such as a windmill in a
 plain (or is that a windmill in a plain text file).  :-)
 

1)  It's gif file format rather than plain text.*
2)  There isn't any windmill.

Best regards,

James Kass,

* P.S. - But, it's a nice gif file.  In fact, aside from the absence of
the windmill, it exceeded my expectations.  -JK.








Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Tex Texin

William,

So let me see if I understand this correctly.

Let's take 2 perfectly good standards, Unicode and HTML, and make some
very minor tweaks to them, such as changing the meaning of U+FFFC and a
special format for filenames in the beginning of the file and a new
extension, so we have something new.

Now the big benefit of this completely new thing, is that programs that
do desktop publishing can use plain text files which are not quite plain
text because they have some special formatting, but now they can publish
them in better manner than before. For example, plain text with
pictures. This is great. (It is true that it is less capable than if we
had just used enough html to do the same thing, but .uof is more like
plain text than html is.) Programmers will be happy because now they can
support plain text with just a few tweaks. Oh I almost forgot, they also
have to support Unicode, but slightly tweaked. And they can also support
HTML, with some minor tweaks for .uof. Of course programmers don't mind
supporting lots of variations of the same thing. Customer support
personnel also don't mind.
Oh, the plain text programmers will now need to support pictures and
other aspects of full publishing, but at least they won't have a complex
file format to work with. I guess it doesn't matter that a more complex
format is also more expressive and therefore can leverage all of the
publishing features. It probably doesn't matter that a desktop
publishing product probably already supports more complex formats, and
probably also supports html, it will be beneficial to add this slight
difference from plain text.

I like this very much. It is very much like when the magician slides the
knot in the string and makes it disappear.

I imagine that over time we will have some more wonderful inventions and
add further tweaks and further improve the publishing of plain text.

There are a few other things I would like to improve in Unicode, so I
hope it will be ok to make some other suggestions. We can change the
extention to know which tweaks we are talking about. .uo1, .uo2. Just a
few small changes to characters and plain text format variations.
Stability of the meaning of the file isn't important.

However, I think my first suggestion will be to make the benefits of
.uof available to XML. We can all this .uo1.

I am a little disconcerted that html already can do everything that .uof
does plus more, and is also supported by all of the publishers that are
like to support .uof. Also, as there are more than a million characters
in Unicode, most are unused so far, so changing the meaning of just FFFC
in this one context doesn't seem like a big win, considering also every
line of code that might work with FFFC now needs to consider the context
to determine its semantics.
But every invention deserves to be implemented, we need not look at
whether the invention satisfies some demand of its customers.

I like the 2 birds picture and I assume it was a metaphor for the idea-
one bird was html the other unicode. I was a little disappointed that
you used html instead of .uof format though. 

Maybe its the lateness of the hour here. I hope the idea looks as good
in the morning.

Oh I almost forgot. I was having difficulty discerning when you and Ken
might be joking. The mails read very serious. I would like to suggest we
make a new format .uo2. We can indicate line numbers and emotions with
plain text characters that look like facial expressions. It would help
me know when you both were serious and when you might be joking.
Sometimes it is hard to tell. I am going to create a list of facial
expressions and assign them in the PUA so we can all have a standard to
follow. See my next mail with a list of facial expressions and
assignments.
tex



William Overington wrote:
 
 Kenneth Whistler wrote as follows about my idea.
 
  It occurs to me that it is possible to introduce a convention, either as
 a
  matter included in the Unicode specification, or as just a known about
  thing, that if one has a plain text Unicode file with a file name that
 has
  some particular extension (any ideas for something like .uof for Unicode
  object file)
 
 ...or to pick an extension, more or less at random, say .html
 
 Well, that could produce confusion with a .html file used for Hyper Text
 Markup Language, HTML.
 
 I suggested .uof so that a .uof file would be known as being for this
 purpose.
 
 
  that accompanies another plain text Unicode file which has a
  file name extension such as .txt, or indeed other choices except .uof (or
  whatever is chosen after discussion) then the convention could be that
 the
  .uof file has on lines of text, in order, the name of the text file then
 the
  names of the files which contains each object to which a U+FFFC character
  provides the anchor.
 
  For example, a file with a name such as story7.uof might have the
 following
  lines of text as its contents.
 
  story7.txt
  horse.gif
  dog.gif
  painting.jpg
 

Re: Furigana

2002-08-16 Thread Peter_Constable

On 08/14/2002 05:53:58 AM James Kass wrote:

Once a meaning like
INTERLINEAR ANNOTATION ANCHOR has been assigned to
a code point, any application which chooses to use that code
point for any other purpose would be at fault.

Since it's for internal use only, nobody would ever know. Unicode 
conformance must always be understood in terms of what happens externally, 
between two processes, or between a process and a user. What goes on 
inside doesn't matter as long as it is conformant on the outside. If my 
program includes a portion of code that interprets all USVs as jelly-bean 
flavours but doesn't let any symptoms of that leak outside, I haven't 
voilated any conformance requirement.



In other words, if these characters are to be used internally for
Japanese Ruby (furigana), etc., then they ought to be able to
be used externally, as well.

They simply aren't adequate for anything more than the simplest of cases. 
Moreover, the recommdations of TR#20 / the W3C character model clearly 
indicate that markup is to be preferred for applications like this.



Because it seems to be an oxymoron.

I think most would agree that that's clear now, but it wasn't always 
understood so clearly.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Tildes on vowels

2002-08-16 Thread Peter_Constable

On 08/13/2002 10:08:00 AM William Overington wrote:

I've been ignoring the list for a few days, but come back to find that not 
much has changed.


2) Superscript, subscript, combining above, and other forms of
identifying placement of characters, are better left to markup or other
rendering systems and file formats (and not for a vehicle intended for
plain text.)

Why?  This call for markup seems to be some deeply held belief that is
treated as if it is a law of nature.  So, some people somewhere decided 
to
think in terms of layers, so, that is up to them:  the fact of the matter 
is
that using individual Private Use Area characters for matters which are
otherwise performable by a sequence of characters starting with a 
character used to mean ENTER MARKUP BUBBLE rather than its specified 
meaning
in the Unicode standard is perfectly reasonable.

While you make comments disparaging layers and markup, you don't seem to 
realise that your own solutions are actually equivalent. They simply 
replace the industry-standard and widely-adopted conventions of XML using 
multi-character sequences like sup.../sup with single-character 
sequences using non-standard, *private*-use characters. Both solutions 
involve layers; both solutions use markup. The only differences are

- one uses character sequences with start and end delimiters, while the 
other uses single characters with point-like effect (their scope is 
implicitly delimited)

- one is a widely-adopted industry standard that has a large number of 
implementations, and the other is merely a proposal entertained by a few 
individuals. 

I'm sure someone has pointed this out already some while ago. 



I am not knocking markup, I am simply saying that there is a choice of 
ways
to do things and that sometimes a direct Private Use Area encoding is a 
good
choice.

You'd better not be knocking markup since you're simply introducing a 
different markup convention. Please recognise and acknowledge this.




then Stefan's suggested characters might be very useful,
particularly if they happen to be in a part of the Private Use Area not 
used
for anything else 

This sounds to me like complete nonsense! Everyone must assume that the 
*entire* PUA is used for something else by somebody. That's the rules of 
the PUA. Case in point: I have a use of the PUA that involves every single 
PUA codepoint, and it is entirely different from Stefan's suggested 
character and any other character you or anybody else on this list has 
ever (to my recollection) suggested for the PUA. It involves using PUA 
codepoints to stand for rational numbers in the sequence 0.5, 0.25, 
0.125... 2^-125068. Name any PUA codepoint, and I can tell you what it 
represents in this private system of mine. (Valid use? Yes. Good use? 
Perhaps in some specific -- but as yet unidentified -- processing 
contexts, but generally, not really. Worth adopting by others? No.)

Or perhaps you mean, in a part of the PUA *I* haven't yet used. If 
you're meaning your own use of the PUA, then please say so, and don't 
speak in general terms that sound like there's one common use for the PUA.



That is true, yet I was not suggesting that.  I am suggesting that within 
a
specialised area of activity, namely transcribing documents and sharing 
the
transcriptions with others who are aware of the technique being used, 
that
such a Private Use Area usage could be of value.

That's valid. But the discussion of specific uses of the PUA for those 
purposes really should be addressed specifically to a group of people that 
have such a need and wish to use a common convention so that they can 
interchange data amongst themselves. 



In short, the proposals do not solve existing problems(1,2,3), conflict
with the current architecture (4,5), have problems themselves (5) and so
are not enticing.

Well, perhaps this needs to be reconsidered in the light of the above
comments.

Reconsidering, the proposals are valid within a *private* group of users 
needing such a solution and needing to interchange data amongst 
themselves. If you try to expand the target group of users beyond that, 
then it is not good for the reasons that were presented. So, both points 
of view are valid in relation to different contexts (one specific, the 
other more general) -- and only those contexts.



Indeed, in relation to the declared aims of this mailing list, I feel 
that
discussion of Private Use Area uses in this list is directly on-topic.

The only problem is that when you talk about the PUA, you tend to express 
things in a way that makes it sound to others as though you mean for those 
proposed uses to apply to a wide group of users. Perhaps that's not what 
you intend, but I believe that's the way many perceive it. The very fact 
that you offer to the list to assign PUA characters and publish details 
when others suggest some idea for a private character contributes to this: 
such an offer isn't necessary since not everyone on this list is 

Re: Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-16 Thread Peter_Constable

On 08/14/2002 02:36:37 PM William Overington wrote:

 U+0360 COMBINING DOUBLE TILDE

 U+035D COMBINING DOUBLE BREVE
 U+035E COMBINING DOUBLE MACRON
 U+035F COMBINING DOUBLE LOW LINE

I also note U+0361 COMBINING DOUBLE INVERTED BREVE and U+0362 COMBINING
DOUBLE RIGHTWARDS ARROW BELOW in the code chart.

I wonder if someone could please clarify how an advanced format font 
would
be expected to use such codes.

In a dumb font, support for these character can be implemented by having 
a glyph that has zero advance width, with the outline extending beyond 
both side-bearings. 

In a smart font, one could position the glyph for one of these combining 
marks using attachment points (i.e. the outline of the glyph for the base 
character includes a target point, and the outline for the combing mark 
includes a specific point that the layout engine aligns over the target 
point), or one could look for certain base + combining mark combinations 
and substitute the sequence of glyphs for a single composite glyph.

The latter approach has limitations in that you have to choose ahead of 
time exactly which combinations you will support, and there can only be a 
limited number of such combinations. Attachment points, in general, have 
the advantage that they can be designed to work with arbitrary 
combinations -- any possible combination. With the double-width combining 
marks, though, things are rather trickier. First, you may need to 
substitute a variant glyph for the combining mark that has a width to 
match the particular pair of base characters -- potentially quite messy; 
and then you have to deal with positioning in relation to two base 
characters at once, which has additional complexity. For instance, when 
positioning a double macron over (say) la, you need to adjust the height 
to the taller of the two glyphs; but you need to make the same adjustment 
for al. One of my co-workers implemented such behaviour in a font using 
Graphite a couple of years ago; my recollection is that there isn't an 
easy way to accomplish this with OT, but I haven't worked with OT enough 
to know for sure.



I understand from an earlier posting in this thread that the format to 
use
in a Unicode plain text file would be as follows.

first letter then combining double accent then second letter

Yes.


As first letter and second letter could be theoretically almost any other
Unicode characters, would the approach be to just place all three glyphs
superimposed onto the screen and hope that the visual effect is 
reasonable

That's one possibility, what I would refer to as the dumb rendering 
implementation.


or would a font have a special glyph within it for each of the 
permutations
of three characters which the font designer thought might reasonably 
occur
yet default to a superimposing of three glyphs for any unexpected
permutation which arises?

This is a possible implementation in a smart-font rendering context.



As a matter of interest, how many characters are there where such double
accents are likely to be used please?  Is it just a few or lots?

This really isn't easy to answer. Someone could tell you, these 29 
combinations... but they might not -- probably do not -- know about what 
every user in the world might have ever needed or will ever need.



While in this general area, could someone possibly say something about 
how
and why U+034F COMBINING GRAPHEME JOINER is used please?

Please read the relevant portions of the standard (see on section 13.2 in 
clause IV of TR#28), and then come back with questions for clarification, 
if needed.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





RE: Furigana

2002-08-16 Thread Peter_Constable

On 08/14/2002 10:52:32 AM Michael Everson wrote:

I'm saying I WANT to use these characters. They solve an apparent
need of mine

They only *appear* to you to solve that need, but in fact do not offer a 
good solution. Markup is recommended for your need.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread Peter_Constable

On 08/14/2002 02:04:50 PM William Overington wrote:

As this concerns the U+FFFC character and the Unicode Technical Committee 
is
due to meet next week, I think it might be helpful if this idea is 
discussed
before the meeting as a straightforward idea like this might mean that 
the
possibility to exchange U+FFFC characters at all if people want to do so 
is
not lost.

This does not solve any problems not already solved. This is not plain 
text; it is a form of interchange markup and a higher-level protocol. 
There are already higher-level markup protocols that accomplish this. The 
standard already specifies that FFFC should not be exported from an 
application or interchanged. There is no reason to change this.


Everybody will welcome the new conventional, graphical-type characters
and scripts that are coming with Unicode 4.0.

What are those please?

See the Proposed characters section of the Unicode site.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-16 Thread Peter_Constable

On 08/14/2002 04:34:27 PM Doug Ewell wrote:

Broad ranges of Planes 0 and 1 have been tentatively blocked out on the
Roadmap for RTL scripts. 

Oh? I was somewhat sharply rebuked a few years for suggesting that such a 
thing be done. References to relevant documentation, please?


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: RE: Furigana

2002-08-16 Thread Peter_Constable

On 08/14/2002 01:16:29 AM starner wrote:

That seems to be basically what William Overington is proposing,
except these characters only handle furigana, instead all markup.

Not quite. WO has proposed characters to be used in interchange. These are 
only intended for internal use by programmers. They are exactly like the 
non-characters at FDD0..FDEF except that these were named to a specific 
function (as was FFFC -- also an internal-use code with a 
specifically-named function).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-16 Thread Michael Everson

At 09:38 +0100 2002-08-16, [EMAIL PROTECTED] wrote:
On 08/14/2002 04:34:27 PM Doug Ewell wrote:

Broad ranges of Planes 0 and 1 have been tentatively blocked out on the
Roadmap for RTL scripts.

Oh? I was somewhat sharply rebuked a few years for suggesting that such a
thing be done. References to relevant documentation, please?

We kept like with like in the Roadmap. Nobody rebuked us.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Gutenberg's ligatures (spins off from Re: Tildes on vowels)

2002-08-16 Thread James Kass


Michael Everson wrote,

 Appropriate font technology for Latin ligature display exists,
 but it isn't enabled yet in Microsoft's Uniscribe.*
 
 That doesn't mean that this particular cataloguing of ligatures in 
 the PUA is a good idea.
 
 The Golden Ligatures Collection simply offers font developers
 and end users an opportunity to make use of some rather
 interesting ligatures in a consistent, although non-standard,
 fashion.
 
 That doesn't make it a good idea.


From the Adobe Glyph List at
http://partners.adobe.com/asn/developer/type/glyphlist.txt

quote
#   1.0  [17 Jul 1997]  Original version
#
0041;A;LATIN CAPITAL LETTER A
00C6;AE;LATIN CAPITAL LETTER AE
01FC;AEacute;LATIN CAPITAL LETTER AE WITH ACUTE
F7E6;AEsmall;LATIN SMALL CAPITAL LETTER AE
00C1;Aacute;LATIN CAPITAL LETTER A WITH ACUTE
F7E1;Aacutesmall;LATIN SMALL CAPITAL LETTER A WITH ACUTE
...
/quote

Small caps get assigned in the PUA in published lists, why not other
presentation forms, too?

Plenty of precedent exists.  This may not be a good idea from an
encoding standpoint, but, right now this is a display issue. 
OpenType technology should eventually enable variants to display
even when correct text encoding is used, but it doesn't work yet.

The Cardo font has presentation forms in the PUA area and so do
the Junicode and Code2000 fonts.  Lots of fonts do.  As a font
designer, you probably can understand a desire to be able to display 
a glyph once it is drawn.  If a designer puts a glyph in a font 
without providing a user with any way to display the glyph;
the designer might as well not have troubled.

Best regards,

James Kass.






Re: New version of TR29:

2002-08-16 Thread Samphan Raruenrom

Mark Davis wrote:
 There is a new version of Unicode Technical Report #29: Text Boundaries on
 http://www.unicode.org/reports/tr29/, covering grapheme-cluster, word and
 sentence boundaries. There are significant modifications to this version;
 for a summary, see http://www.unicode.org/reports/tr29/#Modifications.
 This is a draft version, not a final version. There are a number of open
 issues remaining. Feedback is welcome
 Feedback that is received before the UTC meeting (starting August 20) can be
 made available for the discussion of TR29 at that meeting.

FYI:
There're an open issue regarding grapheme-cluster boundaries in Thai.

* SARA AM as an Other_Grapheme_Extend?

Whether 0E33;THAI CHARACTER SARA AM should be a GraphemeExtend 
character or not?

By Unicode definition, SARA AM is an Lo, not a combining
character. But many Thai applications (MS Office/ Windows/ 
OpenOffice.org) treats SARA AM like a combining character (unlike SARA
AA), i.e. cursor always jump over it. Whether this is right or not is
controversial but the fact is that Windows users are used to it.

My personal question is that, if it is favorable for Thai to treat
SARA AM as part of the previous grapheme cluster, is it possible for
UTC to consider adding SARA AM as an Other_Grapheme_Extend?


---
I also notice that Grapheme_Link is removed from the grapheme-cluster
definition. This is appropriate for Thai because PHINTHU should not
cause two grapheme clusters to be linked together.

-- 
Feel free to disclose the contents of this message.

Regards,
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: Furigana

2002-08-16 Thread Tex Texin



[EMAIL PROTECTED] wrote:
 
 On 08/14/2002 12:45:22 AM Kenneth Whistler wrote:
 
 But even at the time, as the record of the deliberations would
 show, if we had a more perfect record, the proponents were clear
 that the interlinear annotation characters were to solve an
 internal anchor point representation problem.
 
 I recall at the UTC meeting in Jan 2000 (I think it was 2000) there was
 discussion of adding non-character code points for internal use by
 programmers, and I remember Tex suggesting that it might be better to
 identify the specific functions for which internal-use codepoints might be
 needed, as had been done in the case of things like the IA characters. In
 other words, at that time, it seems that they were understood by everyone
 present to be intended for internal use by programmers only.

Peter's made the point that for internal use was understood which is
fine.

Let me add, that my concern with internal-use code points not having
specific functions, is that we now live in a world where software
applications often use third party components (various drivers, shared
libraries, OCXs, DLLs, etc.) internally. Having internal-use code
points, which may not be treated with the right semantics by 3rd
parties that have been integrated with internally, is problematic. You
should be careful and avoid passing these internal-use code points to
third parties, but this greatly inhibits their use, or makes for an
awkward and not easily extensible architecture.

At the time (in the discussion), I don't think we had many examples of
what the uses would be, and it wan't clear that many were needed, since
the functionality could be arrived at with higher level protocols.

So to be clear, when internal-use code points are used, not only do they
need to be filtered from external exchanges, you need to be very clear
about your internal architecture and make sure you don't call a system
function or third party function that might mistreat the i-u. c. p or
worse barf at it.

(Anyway, I think that's what I was thinking at the time. I have trouble
remembering what I said yesterday much less the last millenia.)
tex

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-16 Thread William Overington

Kenneth Whistler replied to my posting as follows.

 An interesting point for consideration is as to whether the following
 sequence is permitted in interchanged documents.

 U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB

 That is, the annotated text is an object replacement character and the
 annotation is a caption for a graphic.

Yes, permitted.

Great.  That may well be useful for free to the end user distance education
using telesoftware upon digital television channels.  A .uof file (as in the
thread An idea for keeping U+FFFC usable. ) could be used with a Unicode
plain text file of some learning material over the broadcast link and a Java
program (also broadcast) could place the pictures with their captions in the
correct place in the text.

As would also be:

U+FFF9 U+FFFC U+FFFC U+FFFA U+FFF9 Temperature U+FFFA a measure of
hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion
U+FFFB
of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change
U+FFFB with time U+FFFC . U+FFFB

Where the first U+FFFC is associated with a URL with a realtime data feed,
the second U+FFFC is a jar file for a 3-dimensional dynamic display
algorithm,
and the third U+FFFC is a banner ad for Swatch watches.


Thank you for this example.  I have analysed it thoroughly using Notepad by
going to a new line and indenting at each occurrence of U+FFF9 and going to
a new line and indenting at each occurrence of U+FFFA, and going to a new
line and placing each U+FFFB beneath the corresponding U+FFFA.  For each
U+FFFC I went to a new line, and placed the U+FFFC beneath the most recent
U+FFF9 or U+FFFA character.

In addition, after each U+FFF. character, for ordinary text, I went to a
new line and indented so that the next ordinary text character was beneath
the U of the most recently entered U+FFF. character, except that after a
U+FFFB the indentation went back two indentation levels.

After each U+FFFC character, and on the same line, I added the details of
the object within parentheses.

This gave the following.

U+FFF9
U+FFFC (URL with a realtime data feed)
U+FFFC (jar file for a 3-dimensional dynamic display algorithm)
U+FFFA
U+FFF9
Temperature
U+FFFA
a measure of hotness, related to the
U+FFF9
kinetic energy
U+FFFA
energy of motion
U+FFFB
of molecules of a substance
U+FFFB
U+FFF9
variation
U+FFFA
rate of change
U+FFFB
with time
U+FFFC (banner ad for Swatch watches)
.
U+FFFB

This took me quite some time to figure out, and was indeed an interesting
challenge.

 It seems to me that if that is indeed permissible that it could
potentially
 be a useful facility.

I was referring to my original example, not to your example!  :-)


Permissible does not imply useful, however, in this case.

That's referring to your example when you refer to this case is it?  :-)

It is
unlikely that you are going to have access to software that would
unscramble such layering in purported plain text, even if you
had agreements with your receivers.

Hmm?  Yet, it is not the example to which I referred.  The example to which
I referred has not been commented upon as to its practical feasibility has
it?

However, is your example that difficult if someone set his or her mind to
it?  Consider for example that the software which does the unscrambling were
to have its own internal list of annotation facilitating characters so that
it assigned, for each page of the final rendered text, the characters in the
list of annotation facilitating characters in order for each U+FFF9 U+FFFA
pairing wherever the U+FFF9 item to be annotated were other than just one or
more U+FFFC characters.  The list of annotation facilitating characters
could be something like U+002A, U+2020, U+2021, U+2051, that is, asterisk,
dagger, double dagger, two asterisks aligned vertically.  The annotation
facilitating character is then placed both after the annotated item and
before the annotation, wherever that may be on the page, such as in a
footnote.  I am not suggesting that an algorithm for such is quickly
programmable, yet it does not seem on the face of it to be as unlikely to be
possible as your comment might perhaps seem to imply.

That is what markup and rich text formats are for.

Well, maybe for your example, yet for my example a plain text file for the
main text together with a .uof file to state 

Re: Discrepancy between Names List Code Charts?

2002-08-16 Thread John Cowan

John Hudson scripsit:

 The newish Gagauz Turkish Latin-script orthography derives from both 
 Turkish and Romanian models. This has led to a peculiar hybrid, in which 
 the cedilla is used for the s and the commaaccent is used for the t.  

ME's remarks in _The Alphabets of Europe_ seem downright bizarre to me:

# Note that in
# Romania, Gagauz uses the characters S WITH COMMA BELOW and T WITH COMMA BELOW.
# In inferior Gagauz typography, the glyphs for these characters are sometimes
# drawn with CEDILLAs, but it is strongly recommended to avoid this practice.
# However, because Gagauz is a Turkic language, it may be left to the user to
# decide whether S WITH COMMA BELOW (as in Romanian) or S WITH CEDILLA (as in
# Turkish) is preferred.

It seems that the last two sentences say that it may be left to the user
to decide whether inferior or superior typography is preferred.

-- 
De plichten van een docent zijn divers, John Cowan
die van het gehoor ook. [EMAIL PROTECTED]
  --Edsger Dijkstra http://www.ccil.org/~cowan




Re: Furigana

2002-08-16 Thread John Cowan

Tex Texin scripsit:

 At the time (in the discussion), I don't think we had many examples of
 what the uses would be, and it wan't clear that many were needed, since
 the functionality could be arrived at with higher level protocols.

One application that has always seemed obvious to me is regular expressions:
a compiled regular expression can be represented by a Unicode string,
with non-characters representing things like any character, zero or more,
one or more, beginning of string, end of string, etc. etc.

-- 
John Cowan   [EMAIL PROTECTED]   http://www.ccil.org/~cowan
One time I called in to the central system and started working on a big
thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
came by, looked over my shoulder and said 'Oh, that happens to me too.
Try hanging up and phoning in again.'  --Beverly Erlebacher




RE: OCR characters

2002-08-16 Thread Winkler, Arnold F

I believe, Eric is talking about the characters on the attached page 8 of
the OCR standard.

Regards
Arnold

 -Original Message-
 From: Eric Muller [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, August 15, 2002 7:44 PM
 To: [EMAIL PROTECTED]
 Subject: OCR characters
 
 
 In our OCR fonts, we have two glyphs named erase (looks 
 like a black 
 square) and grouperase (looks like a long dash). I don't 
 have a copy 
 of the OCR standards, but I suspect those are mandated by these 
 standards. On the other hand, and I can't find traces of those in 
 Unicode, so I suspect they have been unified. But with which 
 characters? 
 More generally, are there other things like that we should aware of?
 
 Thanks,
 Eric.
 
 
 




Page-8-OCR-B.pdf
Description: Binary data


Re: Discrepancy between Names List Code Charts?

2002-08-16 Thread James E. Agenbroad

On Fri, 16 Aug 2002, John Cowan wrote:

 John Hudson scripsit:
 
  The newish Gagauz Turkish Latin-script orthography derives from both 
  Turkish and Romanian models. This has led to a peculiar hybrid, in which 
  the cedilla is used for the s and the commaaccent is used for the t.  
 
 ME's remarks in _The Alphabets of Europe_ seem downright bizarre to me:
 
 # Note that in
 # Romania, Gagauz uses the characters S WITH COMMA BELOW and T WITH COMMA BELOW.
 # In inferior Gagauz typography, the glyphs for these characters are sometimes
 # drawn with CEDILLAs, but it is strongly recommended to avoid this practice.
 # However, because Gagauz is a Turkic language, it may be left to the user to
 # decide whether S WITH COMMA BELOW (as in Romanian) or S WITH CEDILLA (as in
 # Turkish) is preferred.
 
 It seems that the last two sentences say that it may be left to the user
 to decide whether inferior or superior typography is preferred.
 
 -- 
 De plichten van een docent zijn divers, John Cowan
 die van het gehoor ook. [EMAIL PROTECTED]
   --Edsger Dijkstra http://www.ccil.org/~cowan
 
 
   Friday, August 16, 2002
If fools such as I who know no Gagauz may rush in:  It seems to me that
reading is learned habit.  When different people learned to read Gagauz
they may have learned to expect different forms of glyphs because that's
what they were taught. Assuming teaching different conventions isn't based
on an evil intent to pervert the minds of children, differing conventions
are not bad only different. It may be that such different conventions
will gradually evolve to one but I think Unicode would be wise to avoid
attempting to impose standards on how written text appears and should
instead aim to facilitate presentation of text legible to the conventions
of current readers.  
 We all live with two forms of lower case t (with and without the
curved bottom) and lower case g (with and without the closed descender). 
It's possible these different conventions will disappear but until they do
some will want one and some will want the other and I would hope Unicode
could permit rendering software to provide either. 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





some cedillas

2002-08-16 Thread Michael Everson

The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla 
in transcriptions of Yemen placenames.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: OCR characters

2002-08-16 Thread Otto Stolz

Eric Muller had written:

 In our OCR fonts, we have two glyphs named erase [...]

 and grouperase [...] I suspect those are mandated by these 
 standards. On the other hand, and I can't find traces of those in 
 Unicode,


Arnold F. Winkler wrote:
  I believe, Eric is talking about the characters on the attached page 8 of
  the OCR standard.

I don't have ISO 1073 at hand,  only the German
- DIN 66 008 (Jan 1978), which is essentially identical with ISO 
1073/I-1976,
   and
- DIN 66 009 (Sept. 1977), which is based on, but not identical with,
   ISO 1073/II-1976.

DIN 66 008 contains the figure reported by Arnold Winkler. This standard
does not specify the intended usage of these characters -- not beyond their
expressive names.

DIN 66 009 says about the equivalent OCR-B characters (my translation):
  In case of a typo, a keyboard-driven device will print the Character 
Erase
  on top of an erroneous character. This will cause the OCR reading device
  to ignore this position.
  The Group Erase may be either drawn by hand, or printed as discussed in
  the previous paragraph. It will cause the OCR reading device to ignore
  this position.

So, these characters would never be read by an OCR device. They would be
printed only in response to a function key (such as Erase Backwards), but
never sent (encoded as characters) to a device. This means, that they will
not normally be encoded, hence there will probably no need to assgin Uni-
codes to them.

The only exception could be a text discussing these characters, and
their usage. I think, this sort of text would use figures rather than
characters, to show the effect of overprinting in several variants.
(The Erase, and the erased, character's positions may slightly differ.)

So I guess, these characters are deliberately left off Unicode.

Best wishes,
   Otto Stolz





Re: some cedillas

2002-08-16 Thread John Cowan

Michael Everson scripsit:

 The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla 
 in transcriptions of Yemen placenames.

But is it correct?  The National Geographic map on my wall uses s-cedilla
in Romanian place names, and that's definitely wrong.

-- 
Knowledge studies others / Wisdom is self-known;  John Cowan
Muscle masters brothers / Self-mastery is bone;   [EMAIL PROTECTED]
Content need never borrow / Ambition wanders blind;   www.ccil.org/~cowan
Vitality cleaves to the marrow / Leaving death behind.--Tao 33 (Bynner)




Re: some cedillas

2002-08-16 Thread Michael Everson

At 10:58 -0400 2002-08-16, John Cowan wrote:
Michael Everson scripsit:

  The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla
  in transcriptions of Yemen placenames.

But is it correct?  The National Geographic map on my wall uses s-cedilla
in Romanian place names, and that's definitely wrong.

The Times Atlas does use t-comma-below with Romanian placenames. 
Whether Times practice is correct for transliterating Arabic I 
couldn't say, but it's what they are doing.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread William Overington

James Kass wrote as follows.

William Overington wrote,


 No, it is a story about an artist who wanted to paint a picture of a
horse
 and a picture of a dog and, since he knew that the horse and the dog were
 great friends and liked to be together and also that he only had one
canvas
 upon which to paint, the artist painted a picture of a landscape with the
 horse and the dog in the foreground, thereby, as the saying goes,
painting
 two birds on one canvas,
http://www.users.globalnet.co.uk/~ngo/bird0001.htm
 in that he achieved two results by one activity.  In addition the picture
 has various interesting details in the background, such as a windmill in
a
 plain (or is that a windmill in a plain text file).  :-)


1)  It's gif file format rather than plain text.*
2)  There isn't any windmill.

The picture of the birds has been in our family webspace since 1998 as an
illustration for the saying Painting two birds on one canvas.  That
saying, originated by me, is a peaceful saying meaning to achieve two
results by one activity.  I made the picture from clip art as a learning
exercise.

The picture of the birds is referenced as a way of illustrating the saying
Painting two birds on one canvas.  It is not the picture in the story
about which Ken asked.  I may well have a go at constructing such a picture,
perhaps using clip art.  The reference to a windmill is meant as a humourous
aside to Don Quixote tilting at windmills.

I am interested in creative writing, so when Ken asked about the story, I
just thought of something to put in my response.  Part of the training in,
and the fun of, creative writing is to be able to write something promptly
to a topic.

William Overington

16 August 2002







Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-16 Thread William Overington

Tex Texin wrote as follows.

William,

So let me see if I understand this correctly.

Let's take 2 perfectly good standards, Unicode and HTML,

Yes.

and make some
very minor tweaks to them,

No.

such as changing the meaning of U+FFFC and a
special format for filenames in the beginning of the file and a new
extension, so we have something new.

I have suggested no changes whatsoever to HTML at all.

The only thing which I have suggested in relation to Unicode in this thread
is that, in relation to the fact that information about the object to which
any particular use of U+FFFC refers is kept outside the character data
stream, that it could be a good idea to define a file format .uof so that
details of the names of the files for which the U+FFFC codes are anchors
could be provided in a known format, if and only if end users chose to use a
.uof file for that purpose on that occasion and not otherwise.  This was in
the context of seeking to protect the use of U+FFFC as a character which
could be used in interchanging of documents following from the discussion of
U+FFFC and annotation characters in the thread from off of which I spun this
thread, which discussion, by Ken and Doug, is repeated in the first posting
of this present thread.

I thought it a good idea that the Unicode Technical Committee might like to
make such a .uof file format an official Unicode document so as to offer one
possible way to use U+FFFC codes.  That is now a matter for discussion.  If
the Unicode Consortium wishes to do that, then fine.  If the Unicode
Consortium chooses not to do that, then I can write it up myself and publish
it, which is not such a good solution, yet is adequate for my own needs and
might be useful for some other people if they choose to use the same format
for .uof files.

Hopefully I have now managed to raise the issue of protecting the fact that
the U+FFFC character can be used in document interchange and it will
hopefully not become deprecated to the status of a noncharacter.

There is a practical reason for this, which is, from my own perspective,
quite important.  This is as follows.

The DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system
(details at http://www.mhp.org ) which implements my telesoftware invention.
A Java program which has been broadcast can read a Unicode plain text file
and act upon the characters within it, and can read other file formats, such
as .png files (Portable Network Graphics) and act upon the information in
those files, so as to produce a display.

So, a collection of files, namely a .uof file in the format that I suggested
it, a Unicode plain text file with one or more U+FFFC characters in it and
the appropriate graphics files in .png format as a package of free to the
end user distance education learning material being broadcast from a direct
broadcasting satellite or a terrestrial transmitter could be a very useful
facility as the way to carry text with illustrations.

Using HTML and a browser is just not the way to proceed in that situation.
HTML and a browser is a very useful technique for the web and indeed is an
option for the DVB-MHP system, yet the basic software system is Java based.
It is as if the television set is acting as a computer which has a slow read
only access disc drive in the sky from which it may gather information,
including software.  The system is interactive with no return information
link to the central broadcasting computer, by means of the telesoftware
invention.  Overlays and virtual running with programs bigger than the local
storage being able to be run using chaining techniques are possible.  Please
do not think of this as downloading as no uplink request is made!

Now the big benefit of this completely new thing,

Well, it's only a way of sender and receiver being able to have information
in a file with the suffix .uof about what objects are being anchored by
U+FFFC codes in a Unicode plain text file which it accompanies.

is that programs that
do desktop publishing can use plain text files which are not quite plain
text because they have some special formatting,

Well, the plain text files are only Unicode plain text which might contain
one or more U+FFFC characters and some of the other Unicode control
characters such as CARRIAGE RETURN.

but now they can publish
them in better manner than before.

Well, my thinking is that it would help to have a well known way to express
the meaning of the anchors encoded by U+FFFC in a file rather than having
only a vague specification that all other information about the object is
kept outside the data stream.  I am saying that, yes, all other information
about the object is kept outside the data stream and, if, and only if, end
users choose to use a .uof file in a standard format to convey that
information for some particular use of a U+FFFC code, then that format could
be considered for definition and publication by the Unicode Consortium.
That does not seem unreasonable to me.  

Re: Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-16 Thread Doug Ewell

Peter_Constable at sil dot org wrote:

 Broad ranges of Planes 0 and 1 have been tentatively blocked out on
 the Roadmap for RTL scripts.

 Oh? I was somewhat sharply rebuked a few years for suggesting that
 such a thing be done. References to relevant documentation, please?

It looks like the dog ate my homework.  The Roadmap pages I was
referring to:

http://www.unicode.org/roadmaps/bmp-3-7.html  (for Plane 0)
http://www.unicode.org/roadmaps/smp-3-3.html  (for Plane 1)

no longer contain the gray-shaded areas indicating where RTL scripts
are, or were, *TENTATIVELY* blocked out.  The BMP page does still
contain a note explaining this convention:

Areas containing RTL scripts, as well as the Surrogates Zone and the
Private Use Zone are shaded grey here informatively.

but they aren't any more.

Also, the links to PDF versions are broken, so I can't tell whether the
PDF files still contain gray blocks or not.

I would guess a claim that we could absolutely, positively guarantee
that characters in a particular range would always be RTL would earn a
rebuke.  I was just going by what the Roadmaps (used to) say, and that's
why I referred specifically to the Roadmaps and used the word
tentatively.  Should've double-checked first, though.

-Doug Ewell
 Fullerton, California





Re: Furigana

2002-08-16 Thread Tex Texin

John,
Why would you want them to be for internal-use only? When you exchange
regular expressions wouldn't you want operators such as any character
to be passed as well, and standardized so that there is agreement on the
meaning of the expression?

It is also not clear to me that it is desirable to encode operators of
regular expressions as individual characters, because then you get into
the slippery slope of encoding operators for every function that someone
might want, and that is what started this thread isn't it...
(But a Unicode APL operator set would be nice. ;-) )

tex

John Cowan wrote:
 
 Tex Texin scripsit:
 
  At the time (in the discussion), I don't think we had many examples of
  what the uses would be, and it wan't clear that many were needed, since
  the functionality could be arrived at with higher level protocols.
 
 One application that has always seemed obvious to me is regular expressions:
 a compiled regular expression can be represented by a Unicode string,
 with non-characters representing things like any character, zero or more,
 one or more, beginning of string, end of string, etc. etc.
 
 --
 John Cowan   [EMAIL PROTECTED]   http://www.ccil.org/~cowan
 One time I called in to the central system and started working on a big
 thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
 came by, looked over my shoulder and said 'Oh, that happens to me too.
 Try hanging up and phoning in again.'  --Beverly Erlebacher

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Mac OS X Keyboard Layouts (was Re: new version of Keyman)

2002-08-16 Thread Marion Gunn

Arsa Deborah Goldsmith [EMAIL PROTECTED]:
There is lots of good news about keyboards in Mac OS X 10.2, none of

Thank you for that rapid, if intriguing response, Deborah.

which I'm allowed to discuss until August 24, unfortunately. If you
have signed an Apple non-disclosure agreement, write me privately and

I have (signed many an Apple non-disclosure agreement), the first of which
over a decade ago, established EGT's symbiotic relationship with Apple and
enabled my series of translations of generations of Apple Mac operating
systems into Irish - not to mention several much-loved Claris products in
the interim - here's a big 'hi' to any ex-Claris people reading this.:-)
Please write to me privately, as one always bound by those agreements,
Deborah.

If you could answer, as well, another question of great importance to my
local community, I'd appreciate that, Deborah - the question is (given that
EGT fostered/financed the development and distributed free-of-charge via
its own site for so many years the keyboards made in-house here to serve
many small linguistic communities), will Apple's new keyboards (including
those for the 'Celtic' languages) be free of charge to users (that is, will
EGT's policy of not charging end-users a penny for their use be continued)?

I hope it will,
mg


I'll blab about all of it. :-) I will be discussing all this and more
at the San Jose Unicode conference, which, thankfully, is after August
24.

I will try to post something on August 24 giving the basics.

Deborah Goldsmith
Manager, Fonts  Unicode
Apple Computer, Inc.
[EMAIL PROTECTED]


--
Marion Gunn * EGT (Estab.1991) * http://www.egt.ie *
fiosruithe/enquiries: [EMAIL PROTECTED] * [EMAIL PROTECTED] *






Re: some cedillas

2002-08-16 Thread John Hudson

At 06:57 AM 16-08-02, Michael Everson wrote:

The Times Atlas of the World uses t-cedilla, d-cedilla, and h-cedilla in 
transcriptions of Yemen placenames.

I would expect those cedillas to be dots below the letters for standard 
Arabic transliteration.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

Language must belong to the Other -- to my linguistic community
as a whole -- before it can belong to me, so that the self comes to its
unique articulation in a medium which is always at some level
indifferent to it.  - Terry Eagleton





Re: Mac OS X Keyboard Layouts (was Re: new version of Keyman)

2002-08-16 Thread Marion Gunn

Arsa Deborah Goldsmith [EMAIL PROTECTED]:
There is lots of good news about keyboards in Mac OS X 10.2, none of

Thank you for that rapid, if intriguing response, Deborah.

which I'm allowed to discuss until August 24, unfortunately. If you
have signed an Apple non-disclosure agreement, write me privately and

I have (signed many an Apple non-disclosure agreement), the first of which
over a decade ago, established EGT's symbiotic relationship with Apple and
enabled my series of translations of generations of Apple Mac operating
systems into Irish - not to mention several much-loved Claris products in
the interim - here's a big 'hi' to any ex-Claris people reading this.:-)
Please write to me privately, as one always bound by those agreements,
Deborah.

If you could answer, as well, another question of great importance to my
local community, I'd appreciate that, Deborah - the question is (given that
EGT fostered/financed the development and distributed free-of-charge via
its own site for so many years the keyboards made in-house here to serve
many small linguistic communities), will Apple's new keyboards (including
those for the 'Celtic' languages) be free of charge to users (that is, will
EGT's policy of not charging end-users a penny for their use be continued)?

I hope it will,
mg


I'll blab about all of it. :-) I will be discussing all this and more
at the San Jose Unicode conference, which, thankfully, is after August
24.

I will try to post something on August 24 giving the basics.

Deborah Goldsmith
Manager, Fonts  Unicode
Apple Computer, Inc.
[EMAIL PROTECTED]


--
Marion Gunn * EGT (Estab.1991) * http://www.egt.ie *
fiosruithe/enquiries: [EMAIL PROTECTED] * [EMAIL PROTECTED] *






Re: Furigana

2002-08-16 Thread John Cowan

Tex Texin scripsit:

 Why would you want them to be for internal-use only? When you exchange
 regular expressions wouldn't you want operators such as any character
 to be passed as well, and standardized so that there is agreement on the
 meaning of the expression?

Regular expressions are usually interchanged using (some approximation of)
Posix syntax, so as abc.*\*, not abcANYSTAR*.  Note the phrase
compiled form in my posting.

 It is also not clear to me that it is desirable to encode operators of
 regular expressions as individual characters, because then you get into
 the slippery slope of encoding operators for every function that someone
 might want, and that is what started this thread isn't it...

Ah, but for internal use you can do what you want with the 66 non-characters
and the 4 pseudo-non-characters.

 (But a Unicode APL operator set would be nice. ;-) )

Um, we have one of those, don't we?

-- 
John Cowan
[EMAIL PROTECTED]
I am a member of a civilization. --David Brin




22nd Unicode Conference, Sep 2002, San Jose, CA -- Just 3 weeks to go!

2002-08-16 Thread Misha . Wolf

***
Register now!  Just 3 weeks to go  Register now!  Just 3 weeks to go
***

 Twenty-second International Unicode Conference (IUC22)
 Unicode and the Web: Evolution or Revolution?
http://www.unicode.org/iuc/iuc22
  September 9-13, 2002
  San Jose, California

***
Full program now live!  Five days of 3 tracks!  Check the Web site!
***

NEWS

  Visit the Conference Web site ( http://www.unicode.org/iuc/iuc22 )
   to check the Conference program and register.  To help you choose
   Conference sessions, we've included abstracts of talks and speakers'
   biographies.

  Guest rooms at the DoubleTree Hotel San Jose still available at the
   conference rate.

  Early bird registration rate extended to 23 August.

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Microsoft Corporation
   Netscape Communications
   Oracle Corporation
   Reuters Ltd.
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site.

CONFERENCE VENUE

The Conference will take place at:

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Tel: +1 408 453 4000
   Fax: +1 408 437 2898

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
+1 858 638 0504 (fax)

   Email: [EMAIL PROTECTED]
  or: [EMAIL PROTECTED]

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding. The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646. In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations. Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail [EMAIL PROTECTED]

   *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.





- ---
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.




Re: Furigana

2002-08-16 Thread Tex Texin



John Cowan wrote:
 
 Tex Texin scripsit:
 
  Why would you want them to be for internal-use only? When you exchange
  regular expressions wouldn't you want operators such as any character
  to be passed as well, and standardized so that there is agreement on the
  meaning of the expression?
 
 Regular expressions are usually interchanged using (some approximation of)
 Posix syntax, so as abc.*\*, not abcANYSTAR*.  Note the phrase
 compiled form in my posting.

Seems like a very minor optimization then. (I am not saying undesirable,
just it is a small benefit.)
 
  It is also not clear to me that it is desirable to encode operators of
  regular expressions as individual characters, because then you get into
  the slippery slope of encoding operators for every function that someone
  might want, and that is what started this thread isn't it...
 
 Ah, but for internal use you can do what you want with the 66 non-characters
 and the 4 pseudo-non-characters.

Yes. Same thing is true for higher level protocols.
 
  (But a Unicode APL operator set would be nice. ;-) )
 
 Um, we have one of those, don't we?

Sorry, I was unclear. I meant this in the context of encoding a set of
APL-like operators for working on Unicode text to manipulate them in
regular expressions, going way beyond the any character, 0 or more
character operators.

tex

 
 --
 John Cowan
 [EMAIL PROTECTED]
 I am a member of a civilization. --David Brin

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Revised proposal for Missing character glyph

2002-08-16 Thread Carl W. Brown

Proposed unknown and missing character representation.  This would be an
alternate to method currently described in 5.3.

The missing or unknown character would be represented as a series of
vertical hex digit pairs for each byte of the character.  BMP characters
would be represented with 4 hex digits or two pairs of hex digits.  Plane
1-16 characters would be represented as 6 digits or 3 pairs of digits.
Garbage data with non-zero bits 24-31 may require 8 digits or 4 pairs of
digits.

This representation would be recognized by untrained people as unrenderable
data or garbage.  So it would serve the same function as a missing glyph
character except that it would be different from normal glyphs so that they
would know that something was wrong and the text did not just happen to have
funny characters.

It would aid people in finding the problem and for people with Unicode books
the text would be decipherable.  If the information was truly critical they
could have the text deciphered.

The missing character glyphs will be best rendered as a series of glyphs by
a font engine capable of glyph positioning.  If that is not possible it
could also be rendered by displaying a fractional space followed by a set of
two to three hex pair glyphs for each character byte follows by another
fractional space.  This would require 256 glyphs for the vertical hex pairs
and a fractional space glyph.

This proposal would provide a standardized approach that vendors could adopt
to clarify missing character rendering and reduce support costs.  By
including this in the standard we could provide a cross vendor approach.
This would provide a consistent solution.















RE: OCR characters

2002-08-16 Thread Winkler, Arnold F

Otto,

I am looking at ISO 1073/II-1976:

The two erase characters are the only members of set #5, reference numbers
are 120 and 121.  The Remarks column is empty.  6.4 says : Application
advise is given in the column Remarks, where it is indicated, inter alia,
which characters are included for general purpose use only and should not be
used for OCR purposes.  (I guess, an empty column means that the character
can be used for OCR).

I have not found any more information in ISO 1073/II:1976.  Sorry

Arnold

 -Original Message-
 From: Otto Stolz [mailto:[EMAIL PROTECTED]]
 Sent: Friday, August 16, 2002 10:30 AM
 To: Winkler, Arnold F
 Cc: Eric Muller; [EMAIL PROTECTED]
 Subject: Re: OCR characters
 
 
 Eric Muller had written:
 
  In our OCR fonts, we have two glyphs named erase [...]
 
  and grouperase [...] I suspect those are mandated by these 
  standards. On the other hand, and I can't find traces of those in 
  Unicode,
 
 
 Arnold F. Winkler wrote:
   I believe, Eric is talking about the characters on the 
 attached page 8 of
   the OCR standard.
 
 I don't have ISO 1073 at hand,  only the German
 - DIN 66 008 (Jan 1978), which is essentially identical with ISO 
 1073/I-1976,
and
 - DIN 66 009 (Sept. 1977), which is based on, but not identical with,
ISO 1073/II-1976.
 
 DIN 66 008 contains the figure reported by Arnold Winkler. 
 This standard
 does not specify the intended usage of these characters -- 
 not beyond their
 expressive names.
 
 DIN 66 009 says about the equivalent OCR-B characters (my 
 translation):
   In case of a typo, a keyboard-driven device will print the 
 Character 
 Erase
   on top of an erroneous character. This will cause the OCR 
 reading device
   to ignore this position.
   The Group Erase may be either drawn by hand, or printed as 
 discussed in
   the previous paragraph. It will cause the OCR reading 
 device to ignore
   this position.
 
 So, these characters would never be read by an OCR device. 
 They would be
 printed only in response to a function key (such as Erase 
 Backwards), but
 never sent (encoded as characters) to a device. This means, 
 that they will
 not normally be encoded, hence there will probably no need to 
 assgin Uni-
 codes to them.
 
 The only exception could be a text discussing these characters, and
 their usage. I think, this sort of text would use figures rather than
 characters, to show the effect of overprinting in several variants.
 (The Erase, and the erased, character's positions may 
 slightly differ.)
 
 So I guess, these characters are deliberately left off Unicode.
 
 Best wishes,
Otto Stolz
 




RE: OCR characters

2002-08-16 Thread Winkler, Arnold F

Folks, that is my VERY LAST post on this VERY OLD subject:

In the L2 document register I found L2/98-397
http://www.unicode.org/L2/L2/98396.pdf which is a proposal for ISO/IEC TR
15907, a Type 3 TR for the revision of ISO 1073/II:1976.  

On page 18 is a note that says:

NOTE – The glyphs previously defined with reference numbers 120 (CHARACTER
ERASE) and 121 (GROUP ERASE) have been deleted.

That's the end of my digging in older documents.

And have a nice weekend too !

Arnold



 -Original Message-
 From: Otto Stolz [mailto:[EMAIL PROTECTED]]
 Sent: Friday, August 16, 2002 10:30 AM
 To: Winkler, Arnold F
 Cc: Eric Muller; [EMAIL PROTECTED]
 Subject: Re: OCR characters
 
 
 Eric Muller had written:
 
  In our OCR fonts, we have two glyphs named erase [...]
 
  and grouperase [...] I suspect those are mandated by these 
  standards. On the other hand, and I can't find traces of those in 
  Unicode,
 




Unicode.org downtime reminder

2002-08-16 Thread Sarasvati

This is a reminder.

The Unicode.ORG system (web services, ftp, and mail lists) will be taken
off-line sometime today for maintenance and upgrades. We will keep the
downtime as short as possible. 

You will receive another note when the system comes back up, but
it may note be possible to warn you again before the system is taken
off-line, due to scheduling with our service provider.

Regards, 
-- Sarasvati 




Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-16 Thread Peter_Constable

On 08/15/2002 06:41:59 AM William Overington wrote:

In essence, though not formally, U+FFF9..U+FFFC are non-characters as
well, and the Unicode semantics just tells what programs *may* find 
them
useful for.  Unicode 4.0 editors: it might be a good idea to emphasize
the close relationship of this small repertoire with the non-characters.

That is not what the specification says.

William, John knows what he is talking about, and is exactly correct: in 
essense, though not formally, FFF9..FFFC are non-characters. No, the 
Standard doesn't say that; that's why he said, not formally. The use 
intended by the Standard is, however, exactly comparable to the 
non-characters at FDD0..FDEF. If they had been defined in the Standard as 
non-characters, the world would not be different in any meaningful way.



It appears to me that the use of the annotation characters in document
interchange is never forbidden and is strongly discouraged only where 
there
is no prior agreement between the sender and the receiver, and that that
strong discouragement is because the content may be misinterpreted
otherwise.  So, if there is a prior agreement, then there is no problem
about using them in interchanged documents.

There appears to be nothing that suggests that U+FFFC cannot be used in 
an
interchanged document.

Well, you've missed the intent of the authors of the Standard, and appear 
not to grasp the mindset. When it says interchange of IA characters may be 
OK given prior agreement, what's really in mind is that e.g. I've written 
code library A that handles some aspects of interlinear annotation, you've 
written code library B that handles different aspects of interlinear 
annotation, and we agree on certain interfaces so that my library can call 
yours or vice versa, and agree that strings passed by those interfaces can 
contain IA characters. That's the kind of thing that's in mind. It does 
*not* imply that anyone should consider create a document containing IA 
characters. 



I know little about Bliss symbols, though I have seen a few of them and 
have
read a brief introduction to them, yet it seems to me that annotating 
Bliss
symbols with English or Swedish is entirely within the specification
absolutely and would be no more than strongly discouraged even if there 
is
no prior agreement between the sender and the receiver.

Of course the Standard doesn't discourage anyone from annotating Bliss 
symbols with English or Swedish; it only discourages the use of IA 
characters as markup in documents.



Further, it seems to me from the published rules that these annotation
characters could possibly be used to provide a footnote annotation 
facility
within a plain text file

That would not be a proposal worth pursuing; in fact, I'd say it's a very 
bad idea. The reason you DO NOT want to use IA characters in a document is 
that you do not know what someone's software will do with them. The 
characters have always been intended for use by software programmers, not 
by content authors. (Ditto for the object replacement character.)



An interesting point for consideration is as to whether the following
sequence is permitted in interchanged documents...

It seems to me that if that is indeed permissible that it could 
potentially
be a useful facility.

On the whole, it would be very unwise to use these characters in documents 
for reasons I explained above. If two people agree to do this, nobody's 
going to send the Unicode police to stop them. But very few of us on this 
list are particularly interested in what is hypothetically possible for 
some pair of us to do. We're far more interested in how widely-used 
implementations should and do work, and in such implementations, 
FFF9..FFFC are assumed not to be use in content.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]