from:"Marco . Cimarosti"

RE: Several BOMs in the same file

2003-03-25 Thread Marco Cimarosti

Stefan Persson wrote:
> Let's say that I have two files, namely file1 & file2, in any Unicode 
> encoding, both starting with a BOM, and I compile them into 
> one by using
> 
> cat file1 file2 > file3
> 
> in Unix or
> 
> copy file1 + file2 file3
> 
> in MS-DOS, file3 will have the following contents:
> 
> BOM
> contents from file1
> BOM
> contents from file2
> 
> Is this in accordance with the Unicode standard, or do I have 
> to remove the second BOM?

IMHO, Unicode should not specify such a behavior. Deciding what a shell
command is supposed to do is a decision of the operating system, not of text
encoding standards.

BTW, consider that both Unix "cat" and DOS "copy" are not limited to Unicode
text files. Actually, they are not even limited to text files at all: you
could use them to concatenate a bitmap with a font with an HTML document
with a spreadsheet... whether the result makes sense or not is up to you
and/or to the applications that will process the resulting file.

Probably, there should be two separate commands (or different options of the
same command): to do a raw byte-by-byte concatenation, and to do an
encoding-aware concatenation of text files.

E.g., imagine a "cat" command with these extensions:

Synopsis
cat [ -... ] [ -R encoding ] { [ -F encoding ] file }
Description:
...
If neither -R or -F's are specified, the concatenation is
done byte by byte.
Options:
...
-R  specifies the encoding of the resulting *text* file;
-F  specifies the encoding of the following *text* file.

You command above would now expand to something like this:

cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 > file3

Provided with information about the input encodings and the expected output
encoding, "cat" could now correctly handle BOM's, endianness, new-line
conventions, and even perform character set conversions. Without this extra
info, "cat" would retain its good ol' byte-by-byte functionality.

Similar options could be added to any Unix command potentially dealing with
text files ("cp", "head", "tail", etc.), as well as to their equivalents in
DOS or other operating systems.

_ Marco

RE: List of ligatures for languages of the Indian subcontinent.

2003-03-18 Thread Marco Cimarosti

Kenneth Whistler wrote:
> Dream on. The information needed exists in books and other
> reference source in libraries, book shops, and other collections
> across India -- and, for that matter, around the world. It is
> "merely" a matter of collecting the relevant information and
> distilling it into succinct, yet complete, statements of the
> relevant information needed for proper typographic practice
> for each script, for each style of each script, for each local
> typographic tradition for each style, and so on.

A couple of hints for William and other people interested in this issue:

-   Akira Nakanishi, "Writing Systems of the World -- Alphabets,
Syllabaries, Pictograms", Tuttle 1980(1999), ISBN 0804816549.
This is charming little book explores all the scripts used in the
world today, giving for each one of them a table of all the signs (apart
Chinese, of course) and an explanation of how the script works. For each
script, it also reproduces a page from a daily newspaper written in that
scripts. The information is not always 100% accurate, however the book
remains an invaluable introduction to the scripts of the world, and a
perfect complement to the reading of the Unicode Standard.

-   The grammars in the "National Integration Series" by Balaji
Publications, Madras, India.
Each grammar in this series is a small A5-format book bearing a
title like: "Learn  in 30 Days through English". The grammars
are not very valid by the linguistic point of view (it's unlikely that the
reader will actually learn an Indian language in one month!), but they all
have a very interesting introduction to the script used by each language,
which also normally includes a table of all the combinations of
consonant+vowel, and a table of the essential consonant clusters, and of
half or subjoined consonants. If you compare the grammars of languages
sharing the same script (such as Sanskrit, Hindi, and Marathi, all written
with the Devanagari script), you can verify how the list of required
"ligatures" varies from a language to another. Notice that also these books
are far from being 100% accurate.

All the above books have low price and are easily found in bookshops in the
UK and elsewhere.

Another good source for making a lists of required glyphs are the existing
non-Unicode fonts for Indic languages. The nicest free collection I have
seen so far is the Akruti GNU TrueType fonts, which contains a set of glyphs
appropriate for most modern usages:

http://www.akruti.com/freedom/

_ Marco

RE: Need encoding conversion routines

2003-03-14 Thread Marco Cimarosti

askq1 askq1 wrote:
> >From: "Pim Blokland" <[EMAIL PROTECTED]>
> 
> >However, you have said this is not what you want!
> >So what is it that you do want?
> 
> I want c/c++ code that will give me UTF8 byte sequence 
> representing a given code-point,
> UTF16 16 bits sequence reppresenting a given 
> code-point, UTF32 
> 32 bits sequence representing a given code-point.
> 
> e.g.
> 
> UTF8_Sequence CodePointToUTF8(Unichar codePoint)
> {
> //I need this code
> }
> 
> UTF16_Sequence CodePointToUTF16(Unichar codePoint)
> {
> //I need this code
> }
> 
> UCS2_Sequence CodePointToUCS2(Unichar codePoint)
> {
> //I need this code
> }

Hint:

#include "ConvertUTF.h"
typedef UTF32 Unichar;
typedef UTF8  UTF8_Sequence  [4 + 1];
typedef UTF16 UTF16_Sequence [2 + 1];
typedef UTF16 UCS2_Sequence  [1 + 1];

_ Marco

RE: Need encoding conversion routines

2003-03-12 Thread Marco Cimarosti

askq1 askq1 wrote:
> I want c/c++ functions/routines that will convert Unicode to 
> UTF8/UTF16/UCS2 encodings and vice-versa. Can some-one point
> me where can I get these code routines?

Unicode's reference implementation is here, but I don't know how much
up-to-date it is with some tiny changes in UTF-8:

http://www.unicode.org/Public/PROGRAMS/CVTUTF/

You can also see the UTF functions provided by ICU, an open source Unicode
library:

http://oss.software.ibm.com/icu/

_ Marco

RE: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Marco Cimarosti

Otto Stolz wrote:
> Beware: When the book is thrown at a large speed, the relativistic
> effects must be taken into account. I hope that the editors took
> pains to find a wording that will not upset anybody to the extend
> that he would throw the book away at a considerable fraction of
> the speed of light...

Yes, I the Unicode Consortium should take steps to make sure that people
purchasing The Unicode Standard 4.0 are completely aware of all the possible
physical implications of such an action. E.g.:

"HEALTH WARNING: Care should be taken when lifting this book, since its
mass, and thus its weight, is dependent on its velocity relative to the
reader."

"WARNING: This book attracts every other piece of matter in the universe,
including books published by other publishers, with a force proportional to
the product of the masses and inversely proportional to the distance between
them."

"CAUTION: This book contains minute electrically charged particles moving at
velocities in excess of five hundred million miles per hour."

"WARNING: This book warps space and time in its vicinity."

"DISCLAIMER: Because of the Uncertainty Principle, it is impossible to find
out at the same time both precisely where this book is and how fast it is
moving."

"ATTENTION: There is an extremely small but non-zero chance that, through a
process known as 'tunneling', this book may spontaneously disappear from its
present location and reappear at any random place in the universe, including
your neighbor's bookshelf. The manufacturer will not be responsible for any
damages or inconvenience that may result."

"IMPORTANT NOTICE TO READERS: The entire physical universe, including this
book, may one day collapse back into and infinitesimally small space. Should
another universe subsequently emerge, the existence of this book in that
universe cannot be guaranteed."

"PLEASE NOTE: Some quantum physics theories suggest that when the consumer
is not directly observing this book, it may cease to exist or will exist
only in a vague and undermined state."

"THIS IS A 100% MATTER PRODUCT: In the unlikely event that this book would
contact antimatter in any form, a catastrophic explosion will result."

_ Marco

(* Paraphrases from
http://www.milk.com/random-humor/quantum_physics_product_warnings.html)

RE: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Marco Cimarosti

Kenneth Whistler wrote:
> [...]
> Of course, further weight corrections need to be applied if reading
> the standard *below* sea level or in a deep cave.

I hope it will not be consider pedantic to observe that the mass or weight
of a book do not change depending on whether someone is reading it or not.
Consequently, the same weight corrections need to be applied also if someone
*throws* the standard in a deep cave.

_ Marco

RE: Need program to convert UTF-8 -> Hex sequences

2003-03-04 Thread Marco Cimarosti

Doug Ewell wrote:
> David Oftedal  wrote:
> 
> > Hm yes, so I see, but I should have been more specific, I actually
> > need an app that can do this automatically, either in ansi C, Perl,
> > or a Linux binary. I need to call it from a script, so it's got to
> > happen automatically. The find-replace to add the 0x I can do, so
> > it's just a matter of converting glyphs to codes.
> 
> From ANSI C, simply write:
> 
> printf("0x%04X ", c);

And, assuming an Unicode implementation of Wide Char library, the conversion
program is ten lines of ANSI C:

#include 
void main (void)
{
   wint_t c;
   while ((c = getwchar()) != WEOF)
  if (c >= 0x00 && c <= 0x7F)
 putchar(c);
  else
 printf("0x%04lX ", c);
}

_ Marco

RE: Need program to convert UTF-8 -> Hex sequences

2003-03-04 Thread Marco Cimarosti

David Oftedal wrote:
> [...] I need a program to convert UTF-8 to hex sequences.
> [...]
> For example, a file with the content æøå would yield the 
> output "0x00E6 0X00F8 0X00E5", and the Japanese
> expression あの人 would yield "0x3042 0x306E 0x4EBA".

SC Unipad, an Unicode editor, can do this for you:

http://www.unipad.org/main/

Unipad use the Java-like format "\u3042", but you can then substitute every
instance of "\u" with "0x".

_ Marco

Khmer encoding model (had no subject)

2003-03-04 Thread Marco Cimarosti

Mijan wrote:
> [...]
> > >3. There are no other cases of a Vowel+Virama combination in the
> > >Unicode encoding model.
> > 
> > Yes, there are. Khmer.
> 
> I do not understand Khmer but I see that it does not use the 
> same 'encoding model'. Please look, you will see that you
> were wrong to use Khmer as an example.

What do you mean by not using the same "encoding model"?

There are actually three Indic scripts that have been encoded with a
different model: Tibetan (subscript letters are encoded separately, rather
than as combinations of virama + consonant), and Thai/Lao (reordrant vowel
marks are encoded in visual order, rather than in phonetic order).

But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the
same way as the scripts of India.

_ Marco

RE: Unicode 4.0 beta characters.

2003-02-24 Thread Marco Cimarosti

William Overington wrote:
> [...]
> In the same U40-2600.pdf document are six Yijing monogram and digram
> symbols.  I wonder if someone could please say something about these
> characters as to their meaning.
> [...]

The Yijing (also spelled "Yi Jing", "I Ching", "I-Ching", etc.) is a very
famous book of ancient China. You can find it and information about it on
the Web or in any bookshop.

The Yi Jing divination system is based on tracing combinations of two
elements called Yin and Yang:

__ __   Yin (an interrupted horizontal line)

_   Yang(an uninterrupted horizontal line).

These are the two monograms. The four digrams (also seen on the flag of
South Korea), the eight trigrams and the sixty-four hexagrams are,
respectively, combinations of two, three, and six monograms stacked on top
of each other.

_ Marco

RE: accented cyrillic characters

2003-02-24 Thread Marco Cimarosti

Barnie De Los Angeles wrote:
> Even after studying the Unicode web site for a while I am not able to 
> find a solution for this issue.
> 
> The task is to include accented cyrillic characters (vowels 
> only) into 
> russian html. (Vowels are accented or "stressmarked" in Russian for 
> educational purpose.)
> 
> My html pages are always utf-8 encoded.
> 
> "Pre-accented" Russian vowels obviously do not exist as Unicode 
> characters of their own.
> 
> I only need one kind of accent. Its Unicode number is 
> probably 0301 and it is called "accent" or sometimes
> "stressmark".
> 
> The remaining question is how to "combine" this accent with a 
> vowel, or: how to get that dammed stress mark 0301 on top of a
> character?

You simply put the accent character *after* the letter character. Either
character can be encoded directly (e.g. in UTF-8) or with a numerical
reference:

Ð°Ì
Ð°́
аÌ
а́

The fourth notation should work independently of the page encoding, while
the other three require a charset declaration by your server, or inserted in
the ... section of your file:

The visual result depends on the font installed on the computer of people
reading your page. The typical results are:

1. Two rectangles: no font supports Cyrillic or combining marks;

2. A rectangle with an accent on top of it: the font supports
combining marks but no Cyrillic;

3. A Cyrillic "a" followed by a rectangle:  the font supports
Cyrillic but no combining marks;

4. A Cyrillic "a" with an accent too high on top of it: the font
supports Cyrillic and combining marks, but it is not a "smart font" (the
accent is so high in order to also fit on a capital letter);

5. A Cyrillic "a" with an accent on top of it, placed at a correct
height: the font supports Cyrillic, combining marks, and it is a "smart
font".

As an author, what you can do to try and force result 5 (or 4, at least) is:

- Specifying that that piece of text should use one of commonly
available fonts that fit your needs, in order of preference. You can do this
with a Cascading Style Sheet or with the  tag. E.g.:
а́
To do this, you must make some assumption about the kind of
operating system(s) used by your users, and know which fonts are commonly
available on those computers.

- Adding a link to a help page (written in English and/or with
Russian text included as a picture) which explains to users how they can set
up their computers to have the proper font support.

- Doing nothing. You have done your part, encoding the page
correctly, so let the users do their homework too.

_ Marco

RE: [REPOST, LONG] XML and tags (LONG)

2003-02-21 Thread Marco Cimarosti

Doug Ewell wrote:
> William took exception to being "reduced" to a company in 
> this way, but I think it makes the scenario a bit more
> realistic. [...]

In fact. I absolutely did not mean to be offensive or anything. Simply "Xyz
Inc." sounded more like something that can pay my rent than "William Xyz".
(But, of course, anybody feel welcome to pay my rent :-)

> > 1. The text MUST be transmitted in UTF-8 (because the CEO of
> > Overington Inc. thinks that UTF-8 is cute).
> 
> That's a perfectly legitimate requirement.  BTW, I think UTF-8 is cute
> too.  :-)

And I as well. In my real life, I am selling UTF-8 as the key to migrate
applications to Unicode.

> Too bad the customer in this scenario didn't think SCSU was cute.

BTW, would it be possible to encode XML in SCSU?

> > a. An XML file is human readable and may be edited with any text
> > editor; although the Plain-14 file claims to be "plain text", each
> > language tag character appears as a three black boxes in any UTF-8
> > editor (and as a random twelve "accented" characters in a non-UTF-8
> > editor).
> 
> While I'm no longer in the business of defending Plane 14 tags, it
> should be mentioned that rendering engines are *not* supposed 
> to display tag characters as black boxes (although they all do).
> [...]

That's fine, but I must add that this ubiquitous bug is the "feature" that
allowed me to minimal editing of my sample text file.

> As for the non-UTF-8 editor, well, UTF-8 was a customer 
> requirement, so not only will the tags display badly,
> so will every other character outside the Basic Latin
> range.

It was a requirement for the host system, not necessarily for every
developer's computer. In real life, many colleagues of mine are still using
DOS editors or old versions of VI, but they are still able to edit source
code in UTF-8, as long as they are just interested in the ASCII part (i.e.,
the commands, tags, statements, etc.).

_ Marco

RE: XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)

2003-02-21 Thread Marco Cimarosti

William Overington wrote:
>
> [... PageDown, Delete ... PageDown, Delete ... PageDown, Delete ...
>  PageDown, Delete ... PageDown, Delete ... PageDown, Delete ...]
>
> > 4. The text files being transmitted MUST be  small (bandwidth is
> >limited!).
> 
> Yes, keep the text file size down, bandwidth is limited.
> 
> > 5. The processing program must be  small (on-board memory is
> > limited!).
> 
> No, for DVB-MHP the on-board memory is fairly large.  The 
> transmission link is the key issue.

Also!

In this case, the solution is *already* out there, if you just accept to
adhere to standards (such as Unicode, Java, *XML*, and Plane-14 tags
deprecation).

You can throw away my silly toy parsers, and upload on your DVB-MHP a
*full-fledged* XML browser written in *Java*, such as this one, designed
exactly for the kind of embedded usage that you described:

http://www.x-smiles.org/xsmiles_objectives.html

When the browser is there, just feed it with whatever kind XML file you
need. You don't even have to decide in advance the file format and encoding:
among other things, X-Smiles understand XSL Formatting Objects (which is
kind of style sheets even smarter of the Cascading Style Sheets that I used
in my example), so you just specify these details directly in the text file.

_ Marco

[REPOST, LONG] XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)

2003-02-21 Thread Marco Cimarosti

I sent this message yesterday but I didn't see it on the Unicode list.
Possibly, this was because the ZIP contained two executable programs: now I
removed them; anyway, the ZIP contains the source code.

BTW, I took the occasion to correct a few grammar errors...

_ Marco

-

(Warning: I have probably succeeded in the impossible task of being more
verbose than Mr. Overington. Please start reading only if you have a few
free time... :-)

William Overington wrote:
>
> [... an interesting bibliography about XML ...]
>
> The more I read about XML the less reason there seems to be to use XML
> instead of tags!
>
> [... many interesting arguments ...]
>
> In particular, for the DVB-MHP (Digital Video Broadcasting -
> Multimedia Home Platform) there is a need to keep the programs
> as small as possible and to keep text files as small as
> possible.
>
> [... more interesting arguments ...]
>
> How would that be done using XML? Would it be done better 
> using XML than using tags? Why, or why not?
>
> [... even more interesting arguments and polite greetings ...]

I confess that I have not been patient enough to read *all* of Mr.
Overington's post. So, I apologize in advance if I have missed part or all
of William's point.

My job is to implement software based on written specifications which
represent my bosses' understanding of the requirements of our customers.
Unfortunately, the specifications that I receive are often verbose and fuzzy
like Mr. Overington's posts... :-) So I had to develop a survival strategy,
which is to quickly pass through the specification documents in search of
wording which might represent the core of what the customer actually wants.
Sometimes this works, sometimes not...

I will be pretending that William is "Overington Inc.", one of the key
customers of the company I work with, and that they are asking me to
implement a protocol to send text over the famous "Overington Multimedia
Broadcasting (OMB)", with the following requirements:

1. The text MUST be transmitted in UTF-8 (because the CEO of
Overington Inc. thinks that UTF-8 is cute).

2. The transmission protocol MUST implement some form of language
tagging (the details of the protocol are up to me). Particularly, the system
needs to distinguish English text from Italian text, because the two
languages will be displayed in different colors (green and red,
respectively).

3. The OveringtonHomeBox(tm) can only accept UTF-8 plain text
interspersed with escape sequences to change color. The escape sequences
have the form "{{color=1}}", where "1" is the id of a color (blue, in this
case).

4. The text files being transmitted MUST be darn small (bandwidth is
limited!).

5. The processing program MUST be darn small (on-board memory is
limited!).

6. A working prototype must be ready by tomorrow.

What I am asked to do is to define the protocol in point 2, and to implement
a software filter to produce the plain-text stream in point 3. As
development time is very narrow, I can not loose much time thinking about
it, so I have to chose one of the two solutions that are on top of my mind:

P. Plane-14 language tags.

X. XML.

I instinctively decide for solution P (because I assume that it would be
simpler and yield smaller files) and start defining my language tagging
protocol:

P.1. According to the intended usage of plane-14 tags, each language
tag will be introduced by a u0E0001 (LANGUAGE TAG) and will terminate with a
u0E007F (CANCEL TAG).

P.2. Within each begin and end tag, I will use a single tag to
identify languages, in order to save space (point 4):
 - u0E0065 (TAG LATIN SMALL LETTER E) switches to English;
 - u0E0069 (TAG LATIN SMALL LETTER I) switches to Italian;
 - u0E005E (TAG CIRCUMFLEX ACCENT) switches back to the previous
language.

Equipped with this simple protocol, I produce a sample text file: see
 in the attached ZIP file, containing the following text:

"Let's learn the week days in Italian: 'Monday' is 'lunedì', 'Tuesday'
is" (...omitted...)

The English sentence is surrounded by tags u0E0001+u0E0065+u0E007F ...
u0E0001+u0E005E+u0E007F, while each embedded Italian word is surrounded by
tags u0E0001+u0E0069+u0E007F ... u0E0001+u0E005E+u0E007F.

Now I need to write a program that converts this file into a file containing
color switching commands, such as:

"{{color=2}}Let's learn the week days in Italian: 'Monday' is
{{color=4}}'lunedÃ¬'{{color=2}}, 'Tuesday' is"...

I begin writing a few utility functions to read and write UTF-8, to write
the color escape sequences, and to handle a simple stack data structure,
needed to implement tag u0E0001+u0E005E+u0E007F. See , in the
attached ZIP file.

Then, I implement my converter as a little program that reads the incoming
language-tagged file from standard input and writes on standard output the
plain text file containing the color escape sequences.

RE: traditional vs simplified chinese

2003-02-13 Thread Marco Cimarosti

Paul wrote:
> To: Edward H Trager
> > Marco Cimarosti has questioned, why do you need to classify 
> > text as being simplified or traditional?
> 
> if i understand their needs correctly, its to implement a 
> search system with search phrases of either "type" of 
> chinese--content would be in both types.

Still, I don't see what's the purpose of "classifying" the user input. What
they really need is rather a special collation algorithm that *ignores* the
difference between corresponding traditional and simplified characters for
the purpose of searching. This is somewhat analogous to making a "caseless"
search.

The easiest way to do it is "folding" both the user's query and the content
being sought to the same form (either traditional or simplified, it doesn't
matter). It may also help to "fold" also other kinds of variants beside
simplified and traditional.

This "folding" is much easy that implementing a full-fledged
simplified<->traditional conversion (which needs to be context sensitive and
dictionary-driven), because the result is just in a temporary buffer used
for comparison, and no one is going to see it.

_ Marco

RE: Indic Vowel/Consonant combinations

2003-02-13 Thread Marco Cimarosti

Andy White wrote:
> > The Unicode Standard disagrees. TUS3.0, Chapter 9, page 214, 
> > Figure 9-3 ("Conjunct Formations"), example (4) [...]
> 
> In the light of Jim Agenbroads information and references, I 
> think this sentence is wrong.

Yes, in *that* light it is, of course... :-)

Just I think that the manual of an encoding standard is a better reference
for understanding that encoding standard than a dictionary of a dead
language.

> I'm sorry Marco, but you have got your rows and coulombs muddled!
> [...]
> I think you are looking at the third row which depicts 
> 'RA+UU' (U+0930 U+0942)
> [...]
> The fourth row up shows the combination of Ha + Vocali R 
> (U+0939 U+0943)

Oooops!

I stand corrected. I have mixed up two different rows.

_ Marco

RE: traditional vs simplified chinese

2003-02-13 Thread Marco Cimarosti

Edward H Trager wrote:
> [...]
> If I were going to write such an algorithm, I would:
> 
>  * First, insure that the incoming text stream to be classified was
>sufficiently long to be probabilistically classifiable.  In other
>words, what's the shortest stream of Hanzi characters needed, on
>average, in a typical Chinese text (on the web, for  example) in
>order to encounter at least one "ge" u+500B or u+4E2A? One "wei" 
>u+70BA or u+4E3A? One "shuo" u+8AAC or u+8BF4? It wouldn't take
>long to figure this out.

Lucky man! I was discussing about a similar subject just yesterday, and
someone came up with this link:

http://lingua.mtsu.edu/chinese-computing/statistics/

The figures in file  make it easy to answer your question: in a
typical text, ? (ge) is the 3.54%, ? (wei) the 1.96%, ? (shuo) the 2,58%,
etc.

_ Marco

RE: traditional vs simplified chinese

2003-02-13 Thread Marco Cimarosti

Zhang Weiwu wrote:
> Take it easy, if you find one 500B (the measure word)  it is 
> usually enough to say it is traditional Chinese, one 4E2A 
> (measure word)  is in simplified Chinese. They never happen 
> together in a logically correct document.

A few examples of perfectly "logically correct documents" that could contain
both characters:

- a bibliography containing books published Mainland China and in Taiwan;

- an article about the Chinese writing system;

- the table of traditional vs. simplified Chinese character;

- discussions on a Chinese newsgroup or mailing list attended by both
Mainland and Taiwan people;

- The following quotation (slightly edited): "Take it easy, if you find one
"個" (the measure word) it is usually enough to say it is traditional
Chinese, one "个" (measure word) is in simplified Chinese. They never happen
together in a logically correct document." :-)

_ Marco

RE: Indic Vowel/Consonant combinations

2003-02-13 Thread Marco Cimarosti

Andy White wrote:
> I think that Jim Agenbroad seems to have neatly come up with the
> solution, and if no one disagrees, this needs to be documented in TUCS
> or at least the Indic FAQ.

The Unicode Standard disagrees. TUS3.0, Chapter 9, page 214, Figure 9-3
("Conjunct Formations"), example (4) says that it should be encoded as
:

"RAd + RIn -> RIn + RAsup"

That's absolutely intentional, as explained in the following paragraph:

"A number of types of conjunct formations appear in these examples:
[...] and (4) a rare conjunct formed with an independent vowel letter, in
this case the vowel letter RI (also known as vocalic r). Note that in
example (4) in Figure 9-3, the dead consonant RAd is depicted with the
nonspacing combining mark RAsup (repha)."

> He said that Devanagri Letter Vocalic R with Superscript Letter Ra
> (Vowel R with reph) should be encoded as "Ra + Vowelsign Vocaliic R"
> (u+0930, u+0943)

Sequence  has indeed the same meaning (i.e. pronunciation) as
the sequence above, but it has a different visual representation. See it in
TUS3.0, Chapter 9, page 222, Table 9-2 ("Sample Ligatures (Continued)"),
right-hand column, 4th row from bottom.

In this ligature, both U+0930 and U+0943 have their normal glyphs, but the
matra is joined in a unusual location (on the middle of the right side of
the letter, rather than below it)-

This visual representation actually exists (I have seen it often on Sanskrit
grammars), and is much more common that .

> The answer to my original question "How then would you encode a visual
> U+0930, U+094D, U+090B" wil then be: "U+0930, U+094D, U+090B 
> of course!"

That would be , of course!! When you want to
force a visible virama, you insert a ZWNJ; why cluttering this simple rule
with meaningless exceptions?

_ Marco

RE: traditional vs simplified chinese

2003-02-13 Thread Marco Cimarosti

Paul Hastings wrote:
> i suppose this is a really simple minded question but is 
> there any way of telling if an incoming chunk of text
> (say from a browser form) is traditional or simplified
> chinese?

Please notice that the classification you want is not always meaningful.
E.g., what if the incoming text is in Spanish? Would you classify it as
traditional or simplified Chinese?...

Anyway. You can obtain the base data for each Chinese character from the
file http://www.unicode.org/Public/UNIDATA/Unihan.txt, by checking the
existence of fields  and .

Any Unicode character, falls in one of these four categories:

0) All characters not listed in Unihan.txt (i.e., non-Chinese
characters) are *neither* "Traditional" nor "Simplified";

1) All characters having  but *no*
 are "Traditional";

2) All characters having  but *no*
 are "Simplified";

3) All other characters listed in Unihan.txt are *both*
"Traditional" and "Simplified".

>From these character-level categories, you can assign a category to the
input stream:

If at least one character has category 1 AND at least one character
has category 2, then:

stream is both "Traditional" and "Simplified (category 3);

Else, if at least one character has category 1, then:

stream is "Traditional" (category 1);

Else, if at least one character has category 2, then:

stream is "Simplified" (category 2);

Else, if at least one character has category 3:

stream is both "Traditional" and "Simplified (category 3
again);

Else (all characters have category 0, then):

stream is neither "Traditional" nor "Simplified (category
0);

End.

Anyway, I don't see how this information could be of any use for any
purpose...

_ Marco

RE: Never say never

2003-02-12 Thread Marco Cimarosti

Kenneth Whistler wrote:
> > Marco Cimarosti wrote:
> > > It has been repeated a lot of times that no more 
> precomposed character
> > will
> > > never ever ever ever be added. ...
> 
> I trust the clarification from John Cowan helped on this -- there
> is no prohibition against adding characters with *compatibility*
> decomposition mappings, because compatibility decompositions do
> not recompose under normalization.

Yes, sorry for having misused the term "precomposed"; I should have said
"composed".

I did notice that the new character just has a compatibility decomposition:
if it had a canonical composition, I would have posted a formal error report
to "[EMAIL PROTECTED]", rather than just a lazy comment on the Unicode
List.

I am not arguing that that "FAX" poses any technical problem, but rather
what looks like case of disattended policies. People asking why Unicode
doesn't contain a character for symbol  are routinely answered that:
"Symbol  would not be added as a character because it is just of
sequence of the existing characters ". After a few
such sound answers, one wonders why this is not also true for , ,
, or  (all of them new characters in 4.0).

In a private mail someone called this an HMS-Pinafore-policy, after a song
in a comic opera, "HMS Pinafore", in which the captain of the ship sings
about his exploits and the things that he has never done:

Chorus: "What never?"
Captain: "No, never."
Chorus: "What  never ?"
Captain: "Well  hardly ever."

Shouldn't these lyrics be added somewhere in the FAQ? :-)

_ Marco

Never say never

2003-02-11 Thread Marco Cimarosti

Unicode's "(n)ever"'s can sometimes be puzzling.

It has been repeated a lot of times that no more precomposed character will
never ever ever ever be added. But now I see from
http://www.unicode.org/charts/PDF/U40-2100.pdf that the following new
character will be added in 4.0:

- code: U+213B
- name: FACSIMILE SIGN
- compatibility decomposition: U+0046 U+0041 U+0058
- translation in plain English: a character to encode the word "FAX", in all
capitals.

_ Marco

RE: Handwritten EURO sign

2003-02-07 Thread Marco Cimarosti

Marion Gunn wrote:
> I wonder if any Unicoders have seen the handwritten EURO sign 
> which differs substantially from the usual computer-generated
> kind?

In Italy, it is becoming common to see a sort of left parenthesis crossed by
a small Z.

Notice that this is very similar to a common handwritten forms of the lira
symbol ("£"): a vertical line crossed by a small Z.

_ Marco

RE: discovering code points with embedded nulls

2003-02-06 Thread Marco Cimarosti

Stefan Persson wrote:
> What is that strange file (winmail.dat) attached to your 
> mail?  I really hope that it isn't a virus.

http://support.microsoft.com/default.aspx?scid=KB;en-us;q241538

(Whether MS Outlook is a virus or not, is still a debated issue. :-)

_ Marco

RE: discovering code points with embedded nulls

2003-02-06 Thread Marco Cimarosti

Doug Ewell wrote:
> Kent Karlsson  wrote:
> 
> >> From what I'm hearing from you all is that a null in UTF-8 is
> >> for termination and termination only.
> >> Is this correct?
> >
> > No, NULL is a character (actually a control character) among many
> > others. However, many C/C++ APIs (mis)use NULL as a string 
> terminator
> > since NULL isn't very useful for other things.
> 
> The use of NULL to terminate strings is a basic part of the Standard C
> library, not just certain APIs.  As such, it doesn't seem 
> right to call this a "misuse" of the character.

Moreover, I always thought that serving as a string terminator or data-end
sentinel is exactly the function NUL was designed for.

_ Marco

RE: discovering code points with embedded nulls

2003-02-05 Thread Marco Cimarosti

Erik Ostermueller wrote:
> I'm dealing with an API that claims it doesn't support 
> unicode characters with embedded nulls.
> I'm trying to figure out how much of a liability this is.

If by "embedded nulls" they mean bytes of value zero, that library can
*only* work with UTF-8. The other two UTF's cannot be supported in this way.

But are you sure you understood clearly? Didn't they perhaps write "Unicode
*strings* with embedded nulls? In that case they could have meant that null
*characters* inside strings. I.e., they don't support strings containing the
Unicode character U+, because that code is used as a string terminator.
In this case, it would be a common and accepted limitation.

> What is my best plan of attack for discovering precisely 
> which code points have embedded nulls
> given a particular encoding?  Didn't find it in the maillist archive.
> I've googled for quite a while with no luck.  

The question doesn't make sense. However:

UTF-8: Only one character is affected (U+ itself);

UTF-16: In range U+..U+ (Basic Multilingual Plane), there are of
course exactly 511 characters affected (all those of form U+00xx or U+xx00),
484 of which are actually assigned. However, a few of these code points are
high or low surrogates, which means that also many characters in range
U+01..U+10 are affected.

UTF-32: All characters are affected, because the high byte of an UTF-32 unit
is always 0x00.

> I'll want to do this for a few different versions of unicode 
> and a few different encodings.

Most single and double-byte encodings behave like UTF-8 (i.e., a single
zero-byte is only needed to encode U+ itself).

> What if I write a program using some of the data files 
> available at unicode.org?
> Am I crazy (I'm new at this stuff) or am I getting warm?
> Perhaps this data file: 
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
> 
> Algorithm:
> INPUT: Name of unicode code point file
> INPUT: Name of encoding (perhaps UTF-8)
> 
> Read code point from file.
> Expand code point to encoded format for the given encoding.
> Test all constituent bytes for 0x00.
> Goto next code point from file.

That would be totally useless, I am afraid.

The only UTF for which this count makes sense is UTF-8, and the result is
"one".

_ Marco

Tailored normalization (was RE: Public Review Issues update)

2003-02-04 Thread Marco Cimarosti

Rick McGowan wrote:
> Please note that the Issues for Public Review have been 
> updated with a new review item regarding tailoring of
> normalization. Please see issue number 7 on this page:

"The UTC is considering allowing limited tailoring of normalization forms."

My €0.02 worth comment:

Issue 7 is to be rejected because it is useless. It is trying to allow what
is already allowed and could not possibly be forbidden.

It has always been possible to invent alternative "normalization" schemes,
similar in principle, but not identical to any of the four Unicode standard
Normalization Forms. This is part of the processing that an application is
allowed do to text, and that an user may expect a certain application to
perform.

E.g., the purpose of a certain program can be to convert traditional CJK
ideographs to simplified ones, or to transliterate one script into another,
or to change all uppercase letters to lowercase.

Some of these character-level operations can work in a very similar way to
the standard normalization forms (and, maybe, even reuse the same library
functions) but, IMHO, there is no need that the Unicode Standard explicitly
authorizes, endorses or even just acknowledges the existence of these
private normalization schemes.

IMHO, if you need to do such a non-standard normalization scheme, just do
it. But invent your own name for it: don't call it "tailored Unicode NFxx".

_ Marco

RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Marco Cimarosti

Keyur Shroff wrote:
> > However, I totally agree with Kent that this funny 
> rendering is *not* a
> > requirement of the Unicode standard, as Keyur Shroff seems 
> to suggest. It
> > is just an example of many "several methods [that] are 
> available to deal
> > with" strange sequences.
> 
> A sequence should not be treated as "strange" sequence if it has been
> written intentionally. It may have some contextual meaning.

I said "strange" in the sense of character sequences that are not part of
the ordinary spelling of any language. In fact, a thing like a matra
floating in the air or on a dotted circle is something that you'd only see
in a text (not necessarily *in* an Indian language) which talks about
spelling, character sets, and the like.

> Also, what is good or bad is also subjective. It may also 
> vary from one script to another.

Yes, but what is mandatory and what is not in Unicode sciould not be too
much subjective, else we could not call it a "standard".

_ Marco

RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Marco Cimarosti

Kent Karlsson wrote:
> Keyur Shroff wrote
> [...]
> > In Indic scripts any sign that appear in text not in 
> > conjunction with a
> > valid consonant base may be rendered with dotted circle as fallback
> > mechanism (Section 5.14 "Rendering Nonspacing Marks"
> > http://www.unicode.org/uni2book/ch05.pdf).
> 
> I don't know where you find support for that position in that text.
> Can you please quote?  There are no "invalid base consonants" for
> any dependent vowel (for Indic scripts; similarly for any 
> other script).

Actually, there is a mention of displaying combining marks on dotted
circles:

"Several methods are available to deal with an unknown composed
character sequence that is outside of a fixed, renderable set [...]. One
method (Show Hidden) indicates the inability to draw the sequence by drawing
the base character first and then rendering the nonspacing mark as an
individual unit - with the nonspacing mark positioned on a dotted circle."
(The Unicode Standard 3.0, page 120 - 5.14 Rendering Nonspacing Marks -
Fallback Rendering)

I add that this is a good way of displaying a combining mark that has no
base character, i.e. one occurring at the begin of a line or paragraph.

However, I totally agree with Kent that this funny rendering is *not* a
requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is
just an example of many "several methods [that] are available to deal with"
strange sequences.

> > Any system implementing this as
> > default behaviour should not be considered buggy.
> 
> Indeed they are.  And it should certainly not be default behaviour.

In this case, I disagree with Kent: displaying these dotted circles is not
mandatory, but certainly not a bug.

> Any combining characters can be placed on any base characters without
> there being any dotted circles displayed.

True. But notice that Kent (against his own opinion) correctly wrote "can",
not "must".

> [...]

_ Marco

RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti

Christopher John Fynn wrote:
> I had thought that the argument for including KSSA as a seperate
> character in the Tibetan block (rather than only having U+0F40 and 
> U+0FB5) was originally for compatibility / cross mapping with 
> Devanagari and other Indic scripts.  

Which is not a valid reason either, considering that U+0F69 and the
combination U+0F40 U+0FB5 are *canonically* equivalent. This means that
normalizing applications are not allowed to treat U+0F69 differntly from
U+0F40 U+0FB5, including displaying them differently or mapping them
differently to something else.

_ Marco

RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Marco Cimarosti

Keyur Shroff wrote:
> But sometimes a user may want visual representation of these 
> symbols in two different ways: with dotted circle and
> without dotted circle.

Why not using a dotted circle character explicity, when you want to see one?

> Example of
> this could be RAsup on top of dotted circle and RAsup on top of space
> character. Current use of space character to eliminate dotted 
> circle is really painful and may create problems in determining 
> language and syllable boundaries.

Languages or syllable boundaries have nothing to do with this. These special
sequences should *never* be part of any syllabe or word in any language:
they are just a way of showing the shape of a glyph, to be used when, e.g.,
talking about typography or spelling.

> The main problem with space character is that unlike
> ZWJ/ZWNJ/Dotted Circle, it falls within the range of other 
> important script "Latin". 

Plain wrong! White-space characters and punctuation do not belong to any
script: character such as " ", "!" and "?" are used for many scripts and
languages. Even the "danda" punctuation, which is in the Devanagari range,
does not belong to Devanagari: it is also used for other Indic scripts.

> Use of INV character in one shot can solve all these
> problems. We can put it in "consonant" class which
> can help text processing applications. [...]

How can calling a "consonant" something which has nothing to do with
consonants help anybody doing anything?

_ Marco

RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Marco Cimarosti

Keyur Shroff wrote:
> In the FAQ
>http://www.unicode.org/faq/indic.html#16
> 
> It is mentioned that following are equivalent
> 
> ISCII Unicode
> KA halant INV KA virama ZWJ
> RA halant INV RAsup (i.e., repha)

The last line is really bizarre! I would agree that it is plain wrong...

What is supposed to appear in column "Unicode" is the Unicode *encoding*
equivalent to the  in the "ISCII" column. But "RAsup (i.e.,
repha)" is the description of a *glyph*.

> In fact there is no way in Unicode to produce RAsup directly, 
> i.e., without using base consonant. [...]

I agree. This issue has been raised several times, and several viable
solutions have been proposed, but I don't remember that Unicode "officials"
ever showed to even acknowledge the problem.

But probably this has been noted down and discussed. I hope to see an
official solution in TUS 4.0.

> SUGGESTION-3:
> 
> Use of SPACE character as consonant may create problem for 
> state machine which finds language/syllable boundary.
> In fact we need a codepoint for one invisible consonant
> (similar to INV in ISCII) in Unicode which can solve
> this problem with Unicode.
> 
> After inclusion of INV character the following can be recommended.
> 
> ISCII Unicode
> KA halant INV KA virama INV
> RA halant INV RA virama INV (i.e., repha)
> INV halant RA INV virama RA (RAsub)

Why not representing INV with a double ZWJ? E.g.:

ISCII Unicode
KA halant INV KA virama ZWJ ZWJ
RA halant INV RA virama ZWJ ZWJ (i.e., repha)
INV halant RA ZWJ ZWJ virama RA (RAsub)

This has the advantage that the most common sequences will work OK also on
old display engines implemented *before* the double-ZWJ convention is
introduced.

E.g., sequence "KA virama ZWJ ZWJ" works well also on an old engine, for the
simple reason that the first ZWJ is enough to do the work, and  the second
ZWJ is invisible.

Of course, an old engine will still display a  for , but that is not worse than displaying  followed by a
white box, which is what would happen with your new INV character.

_ Marco

RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti

Aditya Gokhale wrote:
> Hello Everybody,
> I had few query regarding representation of Devanagari 
> script in Unicode

All your questions are FAQ's, so I'll just reference the entries which
answers them.

> (Code page - 0x0900 - 0x097F). Devanagari is a writing 
> script, is used in Hindi, Marathi and Sanskrit languages. I 
> have following questions - 

Unicode has no code pages:
http://www.unicode.org/faq/basic_q.html#18

> 1. In Marathi and Sanskrit language two characters glyphs of 
> 'la' and 'sha' are represented differently as shown in the 
> image below - 
>  (First glyph is 'la' and second one is 'sha')
> as compared to Hindi where these character glyphs are 
> represented as shown in the image below - 
> (First glyph is 'la' and second one is 'sha')

Unicode encodes (abstract) characters, not glyphs:
http://www.unicode.org/faq/han_cjk.html#3

(This FAQ is in the Chinese/Japanese/Korean section because it is more often
raised for Chinese ideograms.)

> In the same script code page, how do I use these two 
> different Glyphs, to represent the same character ? Is there 
> any way by which I can do it in an Open type font and Free 
> type font implementation ?

Unicode's requirements for fonts:
http://www.unicode.org/faq/font_keyboard.html#1

A few links to OpenType stuff:
http://www.unicode.org/faq/font_keyboard.html#4

> 2. Implementation Query - 
> In an implementation where I need to send / process 
> Hindi, Marathi and Sanskrit data, how do I differentiate 
> between languages (Hindi, Marathi and Sanskrit). Say for 
> example, I am writing a translation engine, and I want to 
> translate a document having Hindi, Marathi and Sanskrit Text 
> in it, how do I know from the code points between 0x0900 and 
> 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

What you need here is some sort of language tagging:
http://www.unicode.org/faq/languagetagging.html

> I would suggest that we should give different code pages 
> for Marathi, Hindi and Sanskrit. May be current code page of 
> Devanagari can be traded as Hindi and two new code pages for 
> Marathi and Sanskrit be added. This could solve these issues. 
> If there is any better way of solving this, any one suggest.

Characters are encoder "per scripts", not "per languages":
http://www.unicode.org/faq/basic_q.html#17

> 3. Character codes for jna, shra, ksh - 
> 
> In Sanskrit and Marathi jna, shra and ksh are considered as 
> separate characters and not ligatures. How do we take care of 
> this ? Can I get over all views on the matter from the group 
> ? In my opinion they should be given different code points in 
> the specific language code page.
> Please find below the character glyphs - 

Unicode encodes Indic analytically:
http://www.unicode.org/faq/indic.html#17

> thanks,

For more details about Devanagari in Unicode, see Chapter 9 of the Standard:
http://www.unicode.org/uni2book/ch09.pdf

_ Marco

RE: unicode for Japanese/Chinese web sites w/forms

2003-01-23 Thread Marco Cimarosti

Eric wrote:
> [...] The sites utilize forms and my current
> programmers cannot code these two sites for form
> uploads with Japanese and Chinese text.

This is a bit generic, and I can't imagine how Unicode could possibly
conflict with HTML forms.

Can't you put on-line an sample Chinese or Japanese page with Unicode and
forms and explain what exactly is wrong with it?

_ Marco

RE: Unicode Standards for Indic Scripts

2003-01-08 Thread Marco Cimarosti

Michael Everson wrote:
> At 06:43 + 2003-01-08, Manoj Jain wrote:
> >Dear Friends,
> >
> >The existing Unicode Standards for Indic scripts have some 
> discrepancies.
> 
> What does that mean? What are "discrepancies"? Can you summarize? I 
> will download the very large files, which will take some time, [...]

As you probably discovered, by now, the three files are just three issues of
a *NEWSLETTERS* by the Ministry of Communications & Information Technology
of India.

I downloaded the first two issues, but I see no mention to any formal
proposal to Unicode. Perhaps it's in the third one.

Amongst many other piece of news, the newsletter reports (in a short notice
on page 7 of the 2nd issue) that the MAIT Consortium on Language Technology
met on September 2001, and recommended that the Unicode "font layout" for
some Indic scripts be modified. Representatives of the Ministry of
Communications & Information Technology, who where present at the meeting,
promised to "take up these recommendations with the Unicode Consortium".

_ Marco

[Very OT] "You're with me always..."

2003-01-08 Thread Marco Cimarosti

Sorry for this small OT. Anyone wishing to contribute a translation to the
next I-can-eat-glass-like project?

"Te tengo conmigo siempre...
con tu nombre en mil idiomas.
En cada nota,
en cada arpegio,
en cada aroma."

http://groups.google.com/groups?&threadm=avg1e5%24eql6d%241%40ID-155044.news
.dfncis.de&prev=/groups%3Fhl%3Den%26safe%3Doff%26group%3Deuropa.linguas

Please e-mail the author directly: mailto:[EMAIL PROTECTED]

_ Marco

Status of Unihan Mandarin readings?

2002-12-19 Thread Marco Cimarosti

I have tried to follow the discussion about the errors in field "kMandarin"
of file "Unihan.txt" but, after a while, I lost my way with all those
dictionary references...

Could someone kindly make a short summary of the situation? Here are my
biggest ???'s:

- Are the errors really there?
- Any estimate as to how many entries are affected?
- Is it only kMandarin affected or also any other fields?
- Any estimates for when it will be possible publish a fixed version?
- Any suggestion for interim work-arounds (e.g., an older version of the
file, an alternative source)?

Thanks in advance.

_ Marco

RE: Precomposed Tibetan

2002-12-19 Thread Marco Cimarosti

Thomas Chan wrote:
> [...]  We don't necessarily want to be making
> vendor/legacy/font-based to unicode mapping tables for every potential
> vendor, do we?

No, of course -- unless that is seen as a necessary counter-move to block a
proposal that would crash the architecture of a script's encoding. :-)

_ Marco

RE: Precomposed Tibetan

2002-12-18 Thread Marco Cimarosti

Andrew C. West wrote:
> On Wed, 18 Dec 2002 04:59:08 -0800 (PST), Marco Cimarosti wrote:
> 
> > Do you have the relevant data?  As I said, so far I found 
> little or nothing
> > about "BrdaRten" or about the "Founders System" mentioned 
> by Ken Whistler.
> 
> Don't need anything more than the code charts given in 
> n2558.pdf - it's simply a
> matter of typing the glyphs in as Unicode (luckily my 
> BabelPad will let me do
> that), and then converting into tabular form. It's really no 
> more difficult than
> decomposing U+00FC (LATIN SMALL LETTER U WITH DIAERESIS) into 
> U+0075 (LATIN
> SMALL LETTER U) and U+00A8 (DIAERESIS).

I am afraid you din't catch my idea: I was talking about conversion
tables/tools for that Tibetan extension to *GB*2312*, which was mentioned by
Ken Whistler.  It is this encoding which, according to the author of,
n2558.pdf is used for "gigabytes of data".

The Unicode precomposed characters don't yet exist (and never will,
hopefully), so there is no converting them from/to anything else.

Ciao.
Marco

RE: Precomposed Tibetan

2002-12-18 Thread Marco Cimarosti

Andrew C. West wrote:
> If anyone thinks that a mapping table would be
> useful as a weapon in the fight against the Chinese proposal, 
> I would be happy to provide one.

Do you have the relevant data?  As I said, so far I found little or nothing
about "BrdaRten" or about the "Founders System" mentioned by Ken Whistler.

_ Marco

RE: Precomposed Ethiopic (Was: Precomposed Tibetan)

2002-12-18 Thread Marco Cimarosti

John Hudson wrote:
> The Ethiopic script is *not* made up of sub-syllabic units: 
> the syllable is 
> the minimum unit of writing. The same is true to Yi and the Canadian 
> Aboriginal Syllabics. The fact that Ethiopic has recently been input 
> phonetically should not lead to confusion about the inherent 
> nature of the 
> script, which is not generative.

I beg to differ here.

Ethiopic is a typical "syllabic alphabet" as Indic scripts are. Better said:
Ethiopic is the MOST typical of ALL syllabic alphabets. Basically, it is a
consonantal alphabet (historically, the southern Arabic alphabet) with
mandatory vowel marks added.

Some of the vowel marks have graphically merged with the base letter in such
a way that, today, it is hard to take them apart. However, this also is true
for many Indic scripts (e.g., see the complex ligatures formed between Tamil
consonants and matras), but did not impede to encode consonant and vowels as
separate abstract characters.

I said above that Ethiopic is the most typical of all syllabic alphabets
because most scholars use the name of the first four Ethiopic *consonants*
(a-bu-gi-da) as their term for "syllabic alphabet".

Canadian syllabics is a strange and unique system but, however, it is more
resembling an abugida than a pure syllabary: the peculiar fact is that vowel
"marks" are represented by the rotation of consonant letters, and that the
"virama" is represented by a change in size.

Of the three scripts you mentioned, only (modern) Yi is a genuine syllabary,
in the same sense as Japanese kana or Linear B are.

Notice that I am only commenting on the graphological nature of Ethiopic.
Whether it was more appropriate to encode Ethiopic in the form of
precomposed syllables or in the form of consonants plus vocalic modifiers,
is an engineering choice about which I don't have a definite opinion. Both
approaches have their pros and cons.

_ Marco

RE: Precomposed Tibetan

2002-12-17 Thread Marco Cimarosti

Michael Everson wrote:
> What the encoding of a set of brDa rTen precomposed syllables would 
> do would be to restrict the Tibetans to this set, to which they have 
> been restricted by the proprietary Founder software used in China. 
> These 950 syllables are insufficient to express anything but 
> newspaper and bureaucratic Tibetan.

I totally agree. My point was another: if it is true that there is a large
existing corpus encoded in that encoding, a prerequisite to reject the
proposal is demonstrating that the path to Unicode is be smooth, with no
risk for the data and no unsustainable costs.

_ Marco

RE: Precomposed Tibetan

2002-12-17 Thread Marco Cimarosti

Carl W. Brown wrote:
> Marco,
> 
> I was disappointed that Unicode used precomposed encoding for 
> Ethiopic.  

Was that my fault? I'm not even a member of Unicode!

_ Marco :-)

RE: Precomposed Tibetan

2002-12-17 Thread Marco Cimarosti

Jungshik Shin wrote:
> [...]
> > http://std.dkuug.dk/jtc1/sc2/WG2/docs/n2558.pdf
> [...]
> 
>  Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
> ATSUI, and Graphite support them if there are opentype Tibetan fonts?
> In addition to the principle of character encoding, the best practical
> counterargument would come from a demonstration that Unicode encoding
> model for Tibetan script does work in practice.

Another key point, IMHO, is verifying the following claim contained in the
proposal document:

"Tibetan BrdaRten characters are structure-stable characters widely
used in education, publication, classics documentation including Tibetan
medicine. The electronic data containing BrdaRten characters are
estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
processing without major modification. Therefore, the international standard
Tibetan BrdaRten characters will speed up the standardization and
digitalization of Tibetan information, keep the consistency of
implementation level of Tibetan and other scripts, develop the Tibetan
culture and make the Tibetan culture resources shared by the world."  [BTW,
billions of what!?]

If the claim proves to be false, well... But if it is true (or even if it is
not but someone insists it is), I think that it is necessary to
*demonstrate* the possibility and convenience of alternative solutions.

I'd propose the following:

1. Find all the available technical details about this BrdaRten
encoding.

2. Come up with a precise machine-readable mapping file between
BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a
sample conversion application.
Reasons: (a) to make it easy to migrate BrdaRten legacy data to
Unicode; (b) to easily update existing BrdaRten applications to export
Unicode text; (c) to easily retrofit new Unicode applications to import
BrdaRten text.

3. (The opposite of point 2) come up with a precise machine-readable
mapping file between *decomposed* Unicode Tibetan and BrdaRten encoding,
possibly accompanied by a sample conversion application.
Reasons: (a) to make it easy to recycle precomposed glyphs from
existing BrdaRten fonts into modern "smart fonts"; (b) to easily update
existing BrdaRten applications to import Unicode text; (c) to easily
retrofit new Unicode applications to export BrdaRten text.

_ Marco

RE: converting devanagari to mangal unicode

2002-12-17 Thread Marco Cimarosti

Bob Hallissy wrote:
> NB: One of the complexities you may run into, and which will limit your
> options, is that your encoding may store text in a different order than
> Unicode requires. If this is the case, TECkit can do the rearrangement for
> you but I'm not sure ICU will easily do that. Certainly the current
> standard for XML-based descriptions of encoding mappings as given in
> Unicode Technical Report 22 (see
> http://www.unicode.org/unicode/reports/tr22/ ) cannot express such
> mappings.

Someone made me notice recently that UTR#22 can indeed implement Indic
visual-to-logical mappings, provided that one chooses the whole Indic
"syllable" as a mapping unit. E.g.:

Of course, this requires very big tables, which could be avoided using a
smarter mechanisms. Moreover, it only works with well-formed sequences in an
anticipated set of languages, but fails with misspellings or new
orthographies.

_ Marco

RE: converting devanagari to mangal unicode

2002-12-17 Thread Marco Cimarosti

John Hudson wrote:
> At 03:09 PM 12/16/2002, Eric Muller wrote:
> 
> >>In order to convert any Devanagari font to be rendered in 
> the same way,
> >
> >May be Sunil is just asking for a conversion of data, 
> presumably from 
> >ISCII to Unicode.
> 
> Ah, yes, this is possible. I'm so used to people asking the 
> other question 
> that I assumed from the slightly mixed up references in the 
> question that this was what Sunil intended.

OK, this is my interpretation of Sunil's question: He has text data encoded
in a so-called "font encoding" (e.g. "Shusha"), and he needs to convert it
to Unicode.

The Linux Technology Development for Indian Languages
(http://www.cse.iitk.ac.in/users/isciig/) has two ongoing projects for
similar conversions:

- iconverter
(http://www.cse.iitk.ac.in/users/isciig/iconverter/main.html)
- ISSCIIlib
(http://www.cse.iitk.ac.in/users/isciig/isciilib/main.html)

_ Marco

RE: Farsi Keheh +06A9 vs. Arabic Kaf +0643 ??

2002-12-12 Thread Marco Cimarosti

Miikka-Markus Alhonen wrote:
> Lainaus Marco Cimarosti <[EMAIL PROTECTED]>:
> > 
> > These made me wonder about a couple of Unicode disunifications:
> > 
> > - U+0643 (ARABIC LETTER KAF) vs. U+06A9 (ARABIC LETTER 
> KEHEH) vs. U+06AA
> > (ARABIC LETTER SWASH KAF);
> 
> "Keheh" vs. "swash Kaf" seems to be contrastive in Sindhi. See
> ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/doc/arabdoc.pdf
> page 49's transliteration table, where "k" produces a swash kaf and
> "kh" a keheh.

Correct, I apologize. This is confirmed by other online sources:

http://www.geocities.com/sindhnj/sindhialpha.html (rightmost column, 3rd row
from bottom);
http://www.sindhiinfo.com/learn/lesson11.asp

_ Marco

RE: Farsi Keheh +06A9 vs. Arabic Kaf +0643 ??

2002-12-12 Thread Marco Cimarosti

Houman Pournasseh wrote:
> The difference between the Arabic Kaf (U+0643) and the Persian Kaf
> (U+06A9) is in it's final form. The Arabic Kaf has a Hamza and is
> missing the diagonal line above the glyoh.

BTW, the "Persian" final form is also common in some Arabic countries.

I am attending a course in Arabic language, and the Moroccan teacher
corrected my "kaf with hamza" (ك) and asked me to always use the "plain
form" (ک) because, as he said, the hamza form is only "seldom" used in
printing. Moreover, the teacher's preferred form for initial/middle form
resembles Unicode's "swash kaf" (ڪ).

This is also confirmed by our textbook, designed for the first grade of
Moroccan elementary schools, where the "plain form" is used throughout, the
"hamza form" being shown once when letter kaf is first introduced.

He also said that the two dots under the final form of ya are almost always
omitted in non vowelized text. He wants us retain them, by now, to
distinguish ya from alif maksura (ى), but only because we are still working
with vowelized text.

Moreover, when he showed us the Arabic shape of digits (as all North
Africans, he normally uses European glyphs), "five" had the typical
"reversed heart" shape of U+0665 (۵).

These made me wonder about a couple of Unicode disunifications:

- U+0643 (ك ARABIC LETTER KAF) vs. U+06A9 (ک ARABIC LETTER KEHEH) vs. U+06AA
(ARABIC LETTER SWASH KAF);

- U+064A (ي ARABIC LETTER YEH) vs. U+06CC (ی ARABIC LETTER FARSI YEH);

- U+0660..U+0669 (٠..٩ ARABIC-INDIC DIGIT ZERO..NINE) vs. U+06F0..U+06F0
(۰..۹ EXTENDED ARABIC-INDIC DIGIT ZERO..NINE).

Shouldn't these better have been considered as font variants?

_ Marco

RE: [OT] HAIKU computer talk

2002-12-05 Thread Marco Cimarosti

Peter Constable wrote:
> On 12/05/2002 02:14:21 AM Joe Becker wrote:
> 
> >Poetry in motion: text elements rendered on sheep
> >
> >http://www.ananova.com/news/story/sm_719935.html
> >http://www.freerepublic.com/focus/news/800101/posts/
> 
> Hmmm... I wonder if I could get money to experiment with monkeys and
> typewriters?

Hmmm... Hmmm... I was thinking: having a certain number of white sheep and
black sheep, and a very clever dog to keep them in position, one could come
up with UTF-Herd!

_Marco

RE: Localized names of character ranges

2002-12-04 Thread Marco Cimarosti

Kenneth Whistler wrote:
> Doug, seconding a suggestion by Marco, wrote:
> 
> > I agree
> > that a multilingual Unicode glossary should be assembled 
> (possibly as a
> > volunteer project) and officially endorsed by the Unicode 
> Consortium, so
> > users and vendors will be on common terminological ground.
> 
> [...]
> Once people start maintaining a multilingual glossary
> based on the online glossary (or supplemented from other
> sources), the burden of maintenance will escalate rapidly
> [...]

Actually, Mark Davis's suggestion was to translate the list of UniData
property names and property values (E.g.: "General category", "Decimal digit
value", "Uppercase letter", "Math symbol", block names, script names, etc.).

These labels should be relatively stable, as they are part of the standard
itself.

I mentioned the glossary only to make examples of terms which may be
difficult to translate, but I would *not* suggest translating the
glossaries: too much a do for nothing. I already tried that with Italian:
all I obtained was a list if Italian words which was just as obscure as the
original English list (or even more obscure, in some cases, as English
technical terms were often more familiar).

Ciao.
Marco

RE: Devanagari

2002-12-03 Thread Marco Cimarosti

Vipul Garg wrote:
> I have downloaded your font chart for Devanagari, which is in 
> the range from 0900 to 097F. I have also installed the Arial 
> Unicode font supplied by Microsoft office XP suite. I found 
> that not all characters are available for Devanagari. For 
> example letters such as Aadha KA, Aadha KHA, Aadha GA etc. 
> are not available. 
>  
> These letters are required in the devanagari words such as 
> KANYA, NANHA, PARMATMA etc.
>  
> If you could provide the above letters then our requirement 
> for formation of Devanagari words would be possible. This 
> requirement is very crucial as we have a large volume project 
> on Devanagari language involving data storage in Oracle database.
>  
> Would appreciate an early reply.

Please, see document "Where is my character":

http://www.unicode.org/unicode/standard/where/

Also have a look to question 17 in the "Indic" FAQ:

http://www.unicode.org/unicode/faq/indic.html#17

All is explained in more detail in Section 9.1 "Devanagari" of the Unicode
manual:

http://www.unicode.org/unicode/uni2book/ch09.pdf

Regards.
M.C.

RE: Localized names of character ranges

2002-12-03 Thread Marco Cimarosti

Mark Davis wrote:
> While not a trivial task (about 400 terms), it is many, many 
> times easier than translating all the significant character
> names. That might someday be worth considering for the
> Common XML Locale Repository
> (http://oss.software.ibm.com/icu/locale/).

The problem is not the number of terms involved (400 strings is not a big
deal: it corresponds to a small localization project), but rather the utter
idiosyncrasy of Unicode-related terminology.

Terms such as "title case", "caseless", "reordering" or "combining" are
nearly impossible to translate satisfactorily in other languages, and even
simple terms such as "character", "letter" or "ideograph", can be tricky, if
they have to remain distinguished from each other.

I tried myself to translate the Unicode glossary in Italian, but I still
have to find satisfactory translations for several entries, although I
knocked to experts of disciplines ranging from typography (I still don't
have a valid equivalent for the "case" in "case sensitive", etc.) to
Hebraism (I am still odds with "cantillation marks").

IMHO, if such an effort is really worth doing, it should be organized and
promoted by the Unicode Consortium itself, rather than in an OEM library
like ICU.

I suggest that the 400-odd property and value name be listed in a text file
on the Unicode FTP site (with each English term well commented and
explained) and translations be collected on a voluntary basis has was done
for the "What's Unicode?" text. The copyright on this material should grant
free and unrestricted usage to any implementation such as ICU.

_ Marco

RE: Proposal to add Bengali Khanda Ta

2002-11-29 Thread Marco Cimarosti

Andy White wrote:
> Marco wrote
> > 
> > I have a few questions:
> > 
> > - What is the meaning of "satmaa" and "sadaatmaa"?
> 
> 'satmaa' means stepmother. 'sadaatmaa' means 'good soul' / 'virtuous'

Bingo! Well, nearly... My guess was that "satmaa" was the Bengali for
"Wachstube". :-)

German has two different words spelled "Wachstube" which pose similar
problems, when set in Fraktur. In "Wach(-)stube" ("guards room"), "s" and
"t" should form an "st" ligature, while in "Wachs(-)tube" ("wax tube"), "s"
and "t" should remain separate because they are parts of two different
roots. For Fraktur, the proposed solution is to encode the second case as
"Wachstube".

But unluckily this cannot work for "satmaa" because of the special Indic
behavior of ZWNJ.

> > - Why is /tmaa/ spelled differently in the two words?
> 
> 'satmaa' has the roots of 'sat' = good  & 'Maa' = mother. As 
> 'Sat' is correctly spelt with a khandaTa under the rules of 
> samaas it becomes 'sat'maa'
> sadaatmaa has the roots 'sat' = good & 'aatma' = soul / 
> spirit, and falls under the rules of sandhi and hence becomes 
> sadaatmaa.
> (aatma is spelt with a tma conjunct).
> 
> > - Does ISCII have a way to distinguish the two cases above 
> > and the other possible combinations? I mean:
> > 1. Ta_Ma_Ligature,
> > 2. Khanda_Ta + Ma,
> > 3. Half_Ta + Ma,
> > 4. Ta + Virama + Ma.
> 
> 1. Ta_Ma_Ligature is simply 'ta virama ma'
> 2. Khanda_Ta + Ma, is 'ta virama virama ma' (equivalent to 
> 'ta virama zwnj ma')
> 3. Half_Ta + Ma is 'ta virama inv ma' (equivalent to 'ta 
> virama zwj ma')
> 4. Ta + Virama + Ma should be 'ta virama virama inv ma' but 
> this is not implemented in the iLeap application I am using!

Cases 1, 2 and 3 are fine. For case 4, personally, I agree that you need
that Khanda Ta is unambiguously encoded.

But does this unambiguous encoding of Khanda Ta necessarily have to be a new
code point in the Bengali block? IMHO, it is possible to define an
unambiguous sequence for Khanda Ta also using existing code points, and
without violating their semantics.

My counter-proposal is:

09A4 + 034F + 09CD
(TA + CGJ + VIRAMA)

CGJ, "Combining Grapheme Joiner", is a (relatively new) zero-width character
which has been introduced to cover some functions that could not be carried
on well by ZWJ.

My idea is that a display engine should uncoditionally transform the above
sequence in a Khanda Ta glyph, *before* doing any other glyph
transformation.

This "strong" way of encoding Kanda Ta would anyway not exclude the default
"soft" formation of Khanda Ta at the end of a word, whith the simple
sequence:

09A4 + 09CD
(TA + VIRAMA)

The reasons for proposing such a (relatively) complicated solution as
opposed to the simpler solution of adding a new code point are:

- To keep a certain compatibility with existing display engines. Upon
sequence <09A4 + 034F + 09CD>, an old display engine would display something
odd but, however, the text should stay *readable*.

- To keep a good compatibility with existing non-visual software. All code
which searches or compares text should already know what to do with CGJ:
ignore it.

- To try and keep the architecture of the Bengali block in sync with the
other Indic blocks, because this helps implementers in re-using code.

I have summarized my counter-proposal in the attached picture. Comments? Can
it work? Is it possible to implement it in, e.g., OpenType fonts?

_ Marco

<>

RE: Proposal to add Bengali Khanda Ta

2002-11-29 Thread Marco Cimarosti

Andy White wrote:
> Please see and comment on my Proposal to add 'Bengali Letter 
> Khanda Ta' to the Bengali Block (initial version): 
> http://www.exnet.btinternet.co.uk/KhandaWeb/khandaproposal.htm

| [...]
| This example shows that in order to display the correct
| spelling of the word 'satmaa', the Ta_Ma needs to be
| displayed as Khanda Ta+Ma, and in order to display the
| correct form of the word 'sadaatmaa' it needs to be
| displayed as a Ta_Ma_conjunct.
| [...]

I have a few questions:

- What is the meaning of "satmaa" and "sadaatmaa"?

- Why is /tmaa/ spelled differently in the two words?

- Does ISCII have a way to distinguish the two cases above and the other
possible combinations? I mean:
1. Ta_Ma_Ligature,
2. Khanda_Ta + Ma,
3. Half_Ta + Ma,
4. Ta + Virama + Ma.

Dhanyabad!

_ Marco

RE: UTF-Morse

2002-11-22 Thread Marco Cimarosti

Otto Stolz wrote:
> Marco, you shall be called "Marcone", or even (granting
> a Pluralis majestatis): "Marconi" ;-)

Hey! I have a little bit of a belly, but not yet enough to justify calling
me "Marcone". :-)

BTW, your careful analysis of Morse needing four code units made me think
that there could be a "digital Morse", where each code unit takes up two
bits. E.g.:

00: letter gap
01: dot
10: dash
11: word gap

Up to four code units could stay in each octet. E.g., the word "MORSE" would
become:

Morse: -- --- .-. ... .
Bits:  10100010 1011 10010001 01010001
Hex:   0xA2 0xA1 0x91 0x51

Wow! One octet less than ASCII! :-)

_ Marco

XTF-Morse (was RE: UTF-Morse)

2002-11-22 Thread Marco Cimarosti

Doug Ewell wrote:
> Yes, it's true.  Marco had sent me his UTF-Morse proposal just
> yesterday, along with a suggestion that I put together an 
> implementation for April Fool's Day.  And darned if I wasn't
> really going to do it.  As a JOKE.
> 
> But Marco, you need to check your invented sequences again.  
> The leading and trailing Morse code units for the
> (non-ASCII) multi-Morse characters conflict with some of the
> single-unit characters.  For example, U+002D -- looks like
> a leading unit, and U+0023 .-.-.. looks like a trailing unit.

--- --- --- --- --- ..--.. ...

Sorry! Not only I use everybody's bandwidth for April fools in advance: I
also get all the details wrong!

I attempted to simplify the wording while translating in English, and I
messed everything up. So now I have to use more bandwidth to send a
corrected version.

> (It's only a JOKE, guys.  Take a breath.)

BTW I recalled that, time ago, the aficionados of faction UTF's on this list
decided to call their creations "XTF's", in order to minimize the
possibility of confusion with real UTF's.

So, everybody reading this message now or in the next years, please take
notice that XTF-Morse is *not* an UTF: just an aborted April fool! So please
don't knock at the Unicode Consortium asking for the last version of the
specs for sending Unicode in Morse!

_ Marco


==
XTF-Morse [*] - "Bringing Unicode in the telegraph age!"


--
0. Terminology

In this document, the following special terms are used:

- "Morse Dot": a short Morse signal; represented with "." in this
  document.

- "Morse Dash": a long Morse signal; represented with "-" in this
  document.

- "Morse Symbol": a sequence of one or more Dots, constituting a
  Morse character such as a letter or a punctuation mark.

- "Morse Pause": a short pause which separates adjacent
  Morse symbols; represented with " " (a space) in this document.

- "Morse Space": a long pause which separates words; represented
  with "/" in this document.

- "Morse Oct": a special Morse Symbol representing three bits of
  an Unicode code point.


--
1. Encoding characters in the "ASCII printable" range.

Each Unicode characters in range U+0020..U+007E is encoded as a Morse
Space, as a single Morse Symbols, or as a sequence of two Morse
Symbols, as specified in the following table:

Code:  XTF-Morse:  Character name:
-- --- --
U+0020 /   SPACE (Morse Space)
U+0021 -.  EXCLAMATION MARK [1]
U+0022 .-..-.  QUOTATION MARK
U+0023 .-.-..  NUMBER SIGN [1]
U+0024 ..-...  DOLLAR SIGN [1]
U+0025 ..-..-  PERCENT SIGN [1]
U+0026 ..-.-.  AMPERSAND [1]
U+0027 ..  APOSTROPHE
U+0028 -.--.-  LEFT PARENTHESIS
U+0029 -.---.  RIGHT PARENTHESIS [1]
U+002A -.  ASTERISK [1]
U+002B --  PLUS SIGN [1]
U+002C --..--  COMMA
U+002D --  HYPHEN-MINUS
U+002E .-.-.-  FULL STOP
U+002F -..-.   SOLIDUS [1]
U+0030 -   DIGIT ZERO
U+0031 .   DIGIT ONE
U+0032 ..---   DIGIT TWO
U+0033 ...--   DIGIT THREE
U+0034 -   DIGIT FOUR
U+0035 .   DIGIT FIVE
U+0036 -   DIGIT SIX
U+0037 --...   DIGIT SEVEN
U+0038 ---..   DIGIT EIGHT
U+0039 .   DIGIT NINE
U+003A ---...  COLON
U+003B ---..-  SEMICOLON [1]
U+003C ---.-.  LESS-THAN SIGN [1]
U+003D ..  EQUALS SIGN [1]
U+003E ---.--  GREATER-THAN SIGN [1]
U+003F ..--..  QUESTION MARK
U+0040 -.-.-.  COMMERCIAL AT [1]
U+0041 ..-- .- LATIN CAPITAL LETTER A [2]
U+0042 ..-- -...   LATIN CAPITAL LETTER B [2]
U+0043 ..-- -.-.   LATIN CAPITAL LETTER C [2]
U+0044 ..-- -..LATIN CAPITAL LETTER D [2]
U+0045 ..-- .  LATIN CAPITAL LETTER E [2]
U+0046 ..-- ..-.   LATIN CAPITAL LETTER F [2]
U+0047 ..-- --.LATIN CAPITAL LETTER G [2]
U+0048 ..--    LATIN CAPITAL LETTER H [2]
U+0049 ..-- .. LATIN CAPITAL LETTER I [2]
U+004A ..-- .---   LATIN CAPITAL LETTER J [2]
U+004B ..-- -.-LATIN CAPITAL LETTER K [2]
U+004C ..-- .-..   LATIN CAPITAL LETTER L [2]
U+004D ..-- -- LATIN CAPITAL LETTER M [2]
U+004E ..-- -. LATIN CAPITAL LETTER N [2]
U+004F ..-- ---LATIN CAPITAL LETTER O [2]
U+0050 ..-- .--.   LATIN CAPITAL LETTER P [2]
U+0051 ..-- --.-   LATIN CAPITAL LETTER Q [2]
U+0052 ..-- .-.LATIN CAPITAL LETTER R [2]
U+0053 ..-- ...LATIN CAPITAL LETTER S [2]
U+0054 ..-- -  LATIN CAPITAL LETTER T [2]
U+0055 ..-- ..-LATIN CAPITAL LETTER U [2]
U+0056 ..-- ...-   LATIN CAPITAL LETTER V [2]
U+0057 ..-- .--LATIN CAPITAL LETTER W [2]
U+0058 ..-- -..-   LATIN CAPITAL LETTER X [2]
U+0059 ..-- -.--   LATIN CAPITAL LETTER Y [2]
U+005A ..-- --..   LATIN CAPITAL LETTER Z [2]
U+005B ..---.  LEFT SQUARE BRACKET [1]
U+005C .-  REVERSE SOLIDUS [1]
U+005D ..  RIGHT SQ

FW: Re: Errors in the Indic FAQ

2002-11-21 Thread Marco Cimarosti

The following message was sent by mistake to "[EMAIL PROTECTED]".

(Notice to Mr. Mitra: "[EMAIL PROTECTED]" is only an archive copy of
the Unicode List. To join the real list, see
"http://www.unicode.org/unicode/consortium/distlist.html";).

_ Marco



-Original Message-

Message: 15
   Date: Thu, 21 Nov 2002 06:10:16 -
   From: "Anirban Mitra" <[EMAIL PROTECTED]>
Subject: Re: Errors in the Indic FAQ

All these problems of A-Jophola-Aakaar and Ra-Japhalaa-Aakaar could 
have been avoided had Unicode Consortium agreed to code A-Jophola-
Aakaar as a seperate letter corrosponding to candra-E in Devanagari 
Japhalaa-Aakaar as its matra equivalent. They are used for identical 
sounds and Devanagari Chandra-E is always translitterated as Japhalaa-
Aakaar in Bengali. Although ISCII-91 did not code this letter, but 
implementers of ISCII, lilke iLeap of C-DAC considered A-Japhalaa-
Aakaar as a separate Modern Bengali Vowel and placed it in Bengali 
Insript Keyboard corrosponding to the position of Candra-E of 
Devanagari. E-Japhalaa-Aakaar is merely an typographical alternate 
form soetimes used interchangably.
Regarding considering Khandata, I would like to inform that 
in ISCII compatible programs (like Apex Language Processor or iLeap) 
it is considered as explicit halant form of ta, which is equivalent 
of ta-hasanta-zwnj in Unicode. As Unicode claims to superset ISCII, 
considering khandata as halant form of ta will be logical allowing 
backward compatibility with ISCII. Moreover  words like akassmaat 
("suddenly") in which Khandata is used as the terminal letter 
corrosponding to halant-ta in Devanagari shows its actual status. It 
is very ridiculous to think that a word ends with a half form (which 
the present Unicode recomendations make us believe). In rare cases 
when we need to show ta-hasanta as a isolated form within a word we 
can use Ta-Virama-zwj-zwnj combination.  (see graphical illustration 
at www.geocities.com/mitra_anirban/khandata.jpg )
Another problem area in ISCII-Unicode conversion in Bengali 
(as well as Oriya) is Ya-nukta. ISCII-91 says Ya (U+092F) in 
Devanagari is equivalent to Ontostho-A (YYA U+09DF) and Ya-nukta in 
(U+095F) Devanagari  is equivalent to Ontostho-Ja (coded as YA 09AF 
in Unicode) in Begali. So while tranliterating Devnagari text to 
Bengali through ISCII-91 (that is one of the stated purpose of the 
code) the letters get interchanged causing improper rendering.

UTF-Morse (was RE: Morse coded Unicode(was: Morse code))

2002-11-21 Thread Marco Cimarosti

Carl W. Brown wrote:
> I think that the bigger issue might be how do you extend Morse code to
> incorporate the Unicode character set.
> [...]

Carl, this is unfair!! You spoiled my April 1st joke in mid November!

Ciao.
Marco :-)



--
UTF-Morse - "Bringing Unicode in the telegraph age!"


1. Unicode characters U+0020..U+007E are encoded according to the
following table:

Code:  UTF-Morse:  Character name:
-- --- --
U+0020 /   SPACE
U+0021 -.  EXCLAMATION MARK [1]
U+0022 .-..-.  QUOTATION MARK
U+0023 .-.-..  NUMBER SIGN [1]
U+0024 ..-...  DOLLAR SIGN [1]
U+0025 ..-..-  PERCENT SIGN [1]
U+0026 ..-.-.  AMPERSAND [1]
U+0027 ..  APOSTROPHE
U+0028 -.--.-  LEFT PARENTHESIS
U+0029 -.---.  RIGHT PARENTHESIS [1]
U+002A -.  ASTERISK [1]
U+002B --  PLUS SIGN [1]
U+002C --..--  COMMA
U+002D --  HYPHEN-MINUS
U+002E .-.-.-  FULL STOP
U+002F -..-.   SOLIDUS [1]
U+0030 -   DIGIT ZERO
U+0031 .   DIGIT ONE
U+0032 ..---   DIGIT TWO
U+0033 ...--   DIGIT THREE
U+0034 -   DIGIT FOUR
U+0035 .   DIGIT FIVE
U+0036 -   DIGIT SIX
U+0037 --...   DIGIT SEVEN
U+0038 ---..   DIGIT EIGHT
U+0039 .   DIGIT NINE
U+003A ---...  COLON
U+003B ---..-  SEMICOLON [1]
U+003C ---.-.  LESS-THAN SIGN [1]
U+003D ..  EQUALS SIGN [1]
U+003E ---.--  GREATER-THAN SIGN [1]
U+003F ..--..  QUESTION MARK
U+0040 -.-.-.  COMMERCIAL AT [1]
U+0041 ..-- .- LATIN CAPITAL LETTER A [2]
U+0042 ..-- -...   LATIN CAPITAL LETTER B [2]
U+0043 ..-- -.-.   LATIN CAPITAL LETTER C [2]
U+0044 ..-- -..LATIN CAPITAL LETTER D [2]
U+0045 ..-- .  LATIN CAPITAL LETTER E [2]
U+0046 ..-- ..-.   LATIN CAPITAL LETTER F [2]
U+0047 ..-- --.LATIN CAPITAL LETTER G [2]
U+0048 ..--    LATIN CAPITAL LETTER H [2]
U+0049 ..-- .. LATIN CAPITAL LETTER I [2]
U+004A ..-- .---   LATIN CAPITAL LETTER J [2]
U+004B ..-- -.-LATIN CAPITAL LETTER K [2]
U+004C ..-- .-..   LATIN CAPITAL LETTER L [2]
U+004D ..-- -- LATIN CAPITAL LETTER M [2]
U+004E ..-- -. LATIN CAPITAL LETTER N [2]
U+004F ..-- ---LATIN CAPITAL LETTER O [2]
U+0050 ..-- .--.   LATIN CAPITAL LETTER P [2]
U+0051 ..-- --.-   LATIN CAPITAL LETTER Q [2]
U+0052 ..-- .-.LATIN CAPITAL LETTER R [2]
U+0053 ..-- ...LATIN CAPITAL LETTER S [2]
U+0054 ..-- -  LATIN CAPITAL LETTER T [2]
U+0055 ..-- ..-LATIN CAPITAL LETTER U [2]
U+0056 ..-- ...-   LATIN CAPITAL LETTER V [2]
U+0057 ..-- .--LATIN CAPITAL LETTER W [2]
U+0058 ..-- -..-   LATIN CAPITAL LETTER X [2]
U+0059 ..-- -.--   LATIN CAPITAL LETTER Y [2]
U+005A ..-- --..   LATIN CAPITAL LETTER Z [2]
U+005B ..---.  LEFT SQUARE BRACKET [1]
U+005C .-  REVERSE SOLIDUS [1]
U+005D ..  RIGHT SQUARE BRACKET [1]
U+005E .-...-  CIRCUMFLEX ACCENT [1]
U+005F --  LOW LINE [1]
U+0060 ...---  GRAVE ACCENT [1]
U+0061 .-  LATIN SMALL LETTER A
U+0062 -...LATIN SMALL LETTER B
U+0063 -.-.LATIN SMALL LETTER C
U+0064 -.. LATIN SMALL LETTER D
U+0065 .   LATIN SMALL LETTER E
U+0066 ..-.LATIN SMALL LETTER F
U+0067 --. LATIN SMALL LETTER G
U+0068 LATIN SMALL LETTER H
U+0069 ..  LATIN SMALL LETTER I
U+006A .---LATIN SMALL LETTER J
U+006B -.- LATIN SMALL LETTER K
U+006C .-..LATIN SMALL LETTER L
U+006D --  LATIN SMALL LETTER M
U+006E -.  LATIN SMALL LETTER N
U+006F --- LATIN SMALL LETTER O
U+0070 .--.LATIN SMALL LETTER P
U+0071 --.-LATIN SMALL LETTER Q
U+0072 .-. LATIN SMALL LETTER R
U+0073 ... LATIN SMALL LETTER S
U+0074 -   LATIN SMALL LETTER T
U+0075 ..- LATIN SMALL LETTER U
U+0076 ...-LATIN SMALL LETTER V
U+0077 .-- LATIN SMALL LETTER W
U+0078 -..-LATIN SMALL LETTER X
U+0079 -.--LATIN SMALL LETTER Y
U+007A --..LATIN SMALL LETTER Z
U+007B --.-..  LEFT CURLY BRACKET [1]
U+007C --.--.  VERTICAL LINE [1]
U+007D --.-.-  RIGHT CURLY BRACKET [1]
U+007E --.---  TILDE [1]


2. All other Unicode characters are encoded with one of seven
multi-Morse schemes:

Code range:Scheme
-  --
U+..U+0007 1
U+0008..U+001F 2
U+007F..U+01FF 3
U+0200..U+0FFF 4
U+1000..U+7FFF 5
U+8000..U+36
U+4..U+10  7

Each scheme uses a Morse sequence of the form ".-.yyy", possibly
preceded by one or more Morse sequences in the form ".-.yyy":

Scheme Bits (x: 0 or 1): UTF-Morse (y: "." if x is 0, "-" if x is 1):
-- 

1  0xxx  .-.yyy 
2  00xx  -..yyy .-.yyy 
3  000x  -..yyy -..yyy .-.yyy 
4    -..yyy -..yyy -..yyy .-.yyy 
5  00xxx

RE: Morse code

2002-11-19 Thread Marco Cimarosti

Andrew C. West wrote:
> On Tue, 19 Nov 2002 04:41:58 -0800 (PST), Radovan Garabik wrote:
> 
> > Moreover, Morse characters are distinct logical entities, primary
> > representation of them is audible
> 
> Precisely. So for example ..- is pronounced "dot dot dash" 
> (three distinct logical entities) not "u".

It can also be pronounced "dididah" (which, by the way, can be transmitted
as "-.. .. -.. .. -.. .- " (which, by the way, can also be pronounced
"dadidit didit dadidit didit dadidit didah didididit" (which, by the way,
can be transmitted as "-.. .- -.. .. -.. .. . / -.. .. -.. .. . / -.. .- -..
.. -.. .. . / -.. .. -.. .. . / -.. .- -.. .. -.. .. . / -.. .. -.. .- 
/ -.. .. -.. .. -.. .. -.. .. ." (which, by the way, can also be pronounced
"dadidit didah dadidit didit dadidit didit dit dadididadit dadidit didit
dadidit didit dit dadididadit dadidit didah dadidit didit dadidit didit dit
dadididadit dadidit didit dadidit didit dit dadididadit dadidit didah
dadidit didit dadidit didit dit dadididadit dadidit didit dadidit didah
didididit dadididadit dadidit didit dadidit didit dadidit didit dadidit
didit dit" (which, by the way, can be transmitted as "-.. .- -..   
TRANSMISSION INTERRUPTED BY SOURCE 

_ Marco

[OT] Curious about persian page numbers (was RE: ISIRI 6219:2002)

2002-11-19 Thread Marco Cimarosti

Roozbeh Pournader wrote:
> We just got our hand on the published copy of the Iranian National
> Standard
> 
>   ISIRI 6219:2002  Information Technology -- Persian Information 
>   Interchange and Display Mechanism, using Unicode
> 
> It is dated November 2002, and is about viii+33 pages.

I confess that I downloaded the document only because I was curious about
numbering of the first viii pages...

This is what the page numbers look like:

..."???", "?", "?", "?", "?"
("Alif", "B", "T", "Th", "J"...)

I guess that the name of the letter "?" ("alif") is spelled out in full to
avoid confusion with page "?" (digit "1").

(Thinking of it, this is the probably also the reason why Roman numerals in
page numbers are lowercase: to avoid confusing pages I and II with pages 1
and 11.)

I notice that also letters "?", "?", "?" ("miim", "ha", and "`ayn") look
similar, respectively, to digits ?" ("2", "5" and "4"). Are these letters
spelled out in full as well?

_ Marco

RE: Errors in the Indic FAQ

2002-11-18 Thread Marco Cimarosti

Andy White wrote:
> A graphical version of this message available here: 
> http://www.exnet.btinternet.co.uk/KhandaWeb/khanda.htm
> 
> It is proposed by the Indic Unicode FAQ that Bengali 
> Khanda_Ta should be encoded as Ta Virama ZWJ ... and that an 
> explicit Ta_Virama can be encoded  as Ta Virama ZWNJ. This 
> information is wrong and must be changed. 

I guess the FAQ is ,
right?

That FAQ is indeed wrong! (And I feel guilty for it: it was inspired by the
preceding FAQ which I submitted about Devanagari and, probably, I was also
asked to double-check it...)

However, IMHO, the only fix needed is deleting this sentence:

"If the sequence U+09A4, U+09CD is not followed by another consonant
letter (such as " ta") it is always displayed as a full ta glyph combined
with the virama glyph."

> First some background facts for the unacquainted.
> Khanda Ta is equivalent to Ta Virama i.e. it is a halant form of Ta.
> Khanda Ta is respected as a separate letter to Ta by Bengalis.
> 
> It is incorrect and nonsensical to place a vowel sign 
> immediately next to a Virama
>
> e.g. the sequence Ta Virama VowelSign.i is wrong. (This 
> sequence implies the rendering, VowelSign.i Ta Virama 
> (VowelSign.i is reordered). This is illogical).
> Therefore, it follows that it is also nonsensical to place a 
> vowel sign immediately after a Khanda Ta (Khanda Ta is 
> equivalent to Ta + Virama.)

This is all true. But where does the FAQ suggest a sequence like ?

> In the Hindi script, you may write the sequence Ka Virama Ta 
> VowelSign.i, and it may be rendered as VowelSign.i followed 
> by a fully legated conjunct.  However if you do not want this 
> fully legated form you may use the sequence Ka Virama ZWJ Ta 
> VowelSign.i and have it rendered as VowelSign.i Half_Ka  Ta
> 
> Now turning to the Bengali example of Ta Virama Ta VowelSign.i
>
> Ta Virama Ta VowelSign.i may be rendered as: VowelSign.i 
> Ta_Ta.fullylegated:
> And going by the FAQ:
> Ta Virama ZWJ Ta VowelSign.i. would be rendered as 
> VowelSign.i._KhandaTa Ta
> But this is clearly wrong, as Kanda Ta has now taken on a 
> vowel sign, which is illegal.

This example would be wrong... But I don't see it in the FAQ.

> What was needed here was a ZWNJ to separate the Ta Virama 
> from the proceeding Ta.
> But according to the FAQ Ta Virama ZWNJ Ta is to be rendered 
> as: Ta_Virama.explicit, Ta (Ta with a visible Virama, Ta).
> Which seems to imply that Ta Virama ZWNJ VowelSign.i would be 
> rendered as: Ta_Virama.explicit,VowelSign.i Ta:

I don't think the FAQ implies this.

In some Indic scripts (e.g., Devanagari), left-side matras reorder around
the whole consonant cluster; in some other scripts (e.g., Tamil, Malayalam),
they reorder around the base consonant only:

Devanagari: Ta Virama ZWNJ Ta MatraI -> MatraI Ta+Virama Ta

Tamil:  Ta Virama (ZWNJ) Ta MatraI -> Ta+Virama MatraI Ta

(Notice that ZWNJ is redundant in Tamil, as the rendering would be identical
without it.)

My assumption is that Bengali, in this respect, behaves with Tamil and
Malayalam.

But this is something which is absolutely not clear from the Unicode Book:
my assumption above is based on discussions on this list, and about
non-Unicode sources such as
.

I think that a FAQ should be provided *by* Unicode about this. Even better,
this should be dealt with in detail in the next edition of the TUS. IMHO,
this is not a typographical detail that can be left to implementers to
settle: it affects the interpretation of text.

> I hope that it is clear from this example that the behaviour 
> of Ta Virama in conjunction with ZWJ & ZWNJ needs to be changed.

Why? The purpose of ZWJ and ZWNJ us one of the few things in Indic Unicode
which is quite clear.

A sequence of consonant+Virama+ZWJ always shows a half form glyph (such as a
the Half-Ta in Devanagari or the Khanda Ta in Bengali).

OTOH, consonant+Virama+ZNWJ always shows a visible virama attached to a full
form.

The difference between Devanagari and Bengali is only when *no* ZWJ or ZWNJ
are present at the end of a word: Bengali behaves as if a ZWJ followed the
virama, while Devanagari behaves as if a ZW*N*J followed the virama.

> Further more, ZWJ should be used to form half consonants in 
> Indic scripts, but it can be seen that Khanda_Ta is not a 
> half form as it is regularly used as  the last letter of a 
> word (half forms never are). 

What's wrong in saying that it is a half form?

> The behaviour should be as follows:
> 
> Ta Virama ZWNJ ... should lead to KandaTa (i.e the halant form of Ta)

This would be against the normal Indic meaning of ZWNJ, which is: show the
virama.

> e.g. The Bengali word kutsit shall be encoded as:
>Ka VowelSign.u Ta Virama ZWNJ Sa VowelSign.i Ta
> and rendered as:
>Ka VowelSign.u Ta VowelSign.i Sa Ta.
>(ZWNJ marks the separating point hence pr

RE: The result of the plane 14 tag characters review.

2002-11-18 Thread Marco Cimarosti

Dominikus Scherkl wrote:
> > A good example is the production of multilingual 
> > manuals, which seem to be more and more common these days.
> This is indeed a very good example.

... of something which is not very appropriate for plain text.

> > I agree that in this example, higher-level markup would do
> > all that is necessary.
> But I'd like to read a "README.TXT" with a plain-text editor.
> These files are very common - and if they're not deprecated
> using plane-14-tags would be very nice to have in an multi-language
> readme (where higher-level tagging is not available).

Why shouldn't it be possible to make a plain-text multi-language
"ReadMe.txt" without language tagging?

Language tagging conveys extra information *about* the text, which is not
*part* of the text itself. This extra information can be useful for a number
of purposes (including selecting language-specific typographic details), but
it is by no means *required* to transmit the text itself.

It is "ReadMe.txt", not "LookAtMyGorgeousTypography.txt". :-)

_ Marco

RE: Morse code

2002-11-18 Thread Marco Cimarosti

Otto Stolz wrote:
> Radovan Garabik wrote:
> 
> > Recently I got a crazy idea: why not include Morse code characters
> > in unicode? (Yes, I know it is crazy, but when Braille is 
> already included...)
> 
> I was under the impression that all three Morse code elements 
> are already
> in Unicode:
>U+00B7
>U+2013

·--  ·- - ·· ···   ·-- ·-· --- -· --·   ·-- ·· -    ··- -
- ··--- -·· ··--·· (*)







































*: "What's wrong with U002D?"

_ Marco

RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Marco Cimarosti

I wrote:
[...]
> A lighter-weight method is not having language tagging at all 
> in plain text. This is appropriate in two cases:
> 
> 3.a) When you don't language tagging.
[...] ^

Sorry: I meant: "When you don't need...".

_ Marco

RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Marco Cimarosti

Doug Ewell wrote:
> 1.  What extra processing is necessary to interpret Plane 14 tags that
> wouldn't be necessary to interpret any other form of tags?

In order for the question to make sense, we should compare plain text with
plain text and rich text with rich text.

1.a) Take plain text: however lightweight it may be to process (or strip)
Plane 14 tags, it is anyway heavier than "zero", which is the amount of
"processing" that would be needed by Plane 14 tags if they did not exist, or
which is needed if they are ignored.

1.b) Take rich text: the processing cost of plain-text is the sum of the
processing costs of each piece of plain-text resulting from the
interpretation of that rich-text protocol. Any additional cost is irrelevant
to this comparison, because it only depends on the complexity of the higher
protocol, and because it occurs *before* the plain-text fragments are
available for processing. E.g., the extra processing needed to parse XML
syntax (including XML language tagging) is not to be counted as plain-text
processing.

> 2.  What extra processing is necessary to ignore Plane 14 tags that
> wouldn't be necessary to ignore any other Unicode character(s)?

No extra processing would be necessary to ignore Plane 14 tags that wouldn't
be necessary to ignore any other Unicode characters. But I fail to see the
point of this question.

> 3.  Is there any method of tagging, anywhere, that is lighter-weight
> than Plane 14?  (Corollary: Is "lightweight" important?)

A lighter-weight method is not having language tagging at all in plain text.
This is appropriate in two cases:

3.a) When you don't language tagging.

4.b) When language tagging can be provided by a higher level protocol.

My assumption is that plain text always falls in case (3.a), and rich text
always falls in case (4.b). So far, I haven't seen any proof that this
assumption is incorrect.

_ Marco

RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Marco Cimarosti

Kenneth Whistler wrote:
> Ahem...
> 
> The Unicode Technical Committee would like to announce that no
> formal decision has been taken regarding the deprecation of
> Plane 14 language tag characters. The period for public review of
> this issue will be extended until February 14, 2003.

Out of curiosity, how did you leave the press room without passing through
the riots? "Deprecate Plane 14 Now" militants has been fighting police for
the whole afternoon, but no car with the Unicode flag was seen passing
through.

_ Marco

RE: Is long s a presentation form?

2002-11-11 Thread Marco Cimarosti

Michael Everson wrote:
> I like to think of the long s as similar to the final sigma. Nobody 
> thinks that final sigma should be a presentation form of sigma.

Never say "nobody": I *do* think that Greek final sigma, final Hebrew
letters, and Latin long s should all be presentation forms. I think that
they are encoded as separate characters only because of compatibility with
pre-existing standards such as ISO 8859.

Occasional exceptions to the general distributional rules of these
presentation forms would not have been a valid reason to encode them as
separate characters. Similar exceptions also occur in Indic and Arabic
scripts (e.g., the Arabic abbreviation for "plural" is a "jiim" in initial
form). These case can be supported in plain-text using ZWJ and ZWNJ:

"Wachstube" = German for "guard room";
"Wachstube = German for "wax tube".
 = Arabic for "plural";

> Nobody really uses long s in modern Roman typography, and it's a lot 
> more convenient to have this as a separate character for the 
> nonce-uses that it has than to expect font designers round the world 
> to add special shaping tables to all their fonts just for this 
> critter.

Why "all their fonts"? Only a few fonts designed for special purposes need
to have the long/short s distinction.

_ Marco

RE: A .notdef glyph

2002-11-08 Thread Marco Cimarosti

Michael Everson wrote:
> The, ah, tail?

Hem, slightly closer to the legs.

_ Marco

RE: A .notdef glyph

2002-11-08 Thread Marco Cimarosti

Michael Everson wrote:
> John was pulling your leg. Sorry I responded to the matter. 

John Hudson wrote:
> I was indeed pulling his leg, but I also knew that he would 
> actually go off and do it.

William Overington wrote:
> Well, you claim that now!  At the time it appeared as a 
> genuine suggestion.
> Reading the suggestion again now in the light of that claim 
> produces no indication that that was the case at the time.

This kind of communication problems could be resolved scientifically by
defining a Private Use Area character to signify "The following paragraph is
leg-pulling":

U+E666 SYMBOL FOR JUST PULLING YOUR LEG

Attached to the present electronic mail, which I am sending today to this
public mailing list and to the three persons listed in the Cc box above, all
interested persons will find a little work of art which I have produced, and
which shows the glyph which should be used by the fount industry for the
character which I have designed above.

I must add that this glyph has an internationalisation problem. The same
concept which the English express with the idiomatic phrase "to pull
someone's leg", in other languages is expressed by different allegories.
Italians, for instance, pull a different part of the body, which is located
at the top of the back side of the legs.

Marco Cimarosti

8 November 2002

<>

RE: ct, fj and blackletter ligatures

2002-11-07 Thread Marco Cimarosti

Kent Karlsson wrote:
> (Subword boundaries are likely hyphenation
> points, whereas occurrences of ff, fi etc. elsewhere are
> unlikely hyphenation points.)

I am sorry to always contradict you but, in Italian, there always is an
hyphenation point between two identical consonant letters. Nevertheless,
Italian typography traditionally requires the "ff", "ffi" and "ffl"
ligatures.

BTW, this leads me to a horrible thought: would a shy hyphen between the two
f's prevent the formation of the "ff" ligature? In this is the case, fonts
might also need to have ++, ++, and
++ into their ligature tables.

_ Marco

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Marco Cimarosti

Lars Kristan wrote:
> > .txtUTF-8   require We want plain text files to
> > have BOM to distinguish
> > from legacy codepage files
> 
> H, what does "plain" mean?! Perhaps files with a BOM 
> should be called "text" files (or .txt files;) as
> opposed to "plain text" files, which in my opinion should
> be just that - _plain_ text. No ASCII plain text file had
> an ASCII signature. I believe "plain text" should be
> something that will be as easy to use (and handle) as
> ASCII plain text files were.

"Plain" per se means nothing, in this context. The term "plain text", in
Unicode jargon, means the opposite of "rich text".

"Rich text" (or "fancy text") is another Unicode jargon term, meaning text
containing *mark-up*, such as HTML, XML, RTF, troff, TeX, proprietary
word-processor formats, etc.

Unicode text not containing mark-up is called "plain text", regardless of
the fact that it might be quite "complicated" by the presence of BOM's, bidi
controls, etc.

_ Marco

RE: Unicodes For Devanagari: Magic The Gathering Card

2002-11-06 Thread Marco Cimarosti

Victor Campbell wrote:
> I'm looking for help with converting the text of a Sanskrit 
> trading card to
> Unicode. I am not connected with the publisher of the card, just a
> programmer who helps support a site for collectors.
> 
> I have set up a test page for experimenting with the 
> Devanagari Unicodes at
> this address:  http://victor.flaminio.com/aa_MySanskrit.html

|| 1. This is what I want: Fungal Shambler

This is NOT what you want!  :-)

You say that you want a vocalic-L (U+090C, decimal 2316), but the glyph I
see in the picture is a CONSONANT LA (U+0932, decimal 2354).

The LA glyph in the Sanskrit 98 font looks different from the LA glyph in
the Unicode charts: in your font, the right side of the letter is rounded,
while on the charts it is straight line. But it is nevertheless the same
letter, coming in two slight typographic variants.

Notice that vocalic L has a "tail" under the letter: *that* is an essential
trait, and the characteristics that distinguishes the consonant from the
vowel.

|| . This I get instead of what I want. 
||
|| pha anusvara ga la virama sha ma virama ba virama [ZWJ] vocalic-l ra
virama 
|| फंगल्शम्ब्‍ऌर् 
||
फंगल्शम्ब्‍
;ऌर् 

The sequence <..., ba, virama, ZWJ, vocalic-l, ...> is wrong. It should be
<..., ba, virama, la, ...>:

..., U+092C, U+094D, U+0932, ...

Or, in decimal HTML:

... ब्ल ...

Apart this, the encoding is correct. (The ZWJ is not wrong, just useless.)

|| SO HOW DO I GET IT TO LOOK LIKE #1 WITH UNICODE? 
|| (The desired form of sha, and ba joined with the desired form of
vocalic-l.) 

If you still see it incorrectly, it is because your font or operating system
doesn't fully support Indic rendering.

You can upgrade your PC to a different font or operating system but,
unfortunately, there is nothing you can do to ensure that your users will do
the same.

BTW, a couple of side notes about the transliteration:

- There is a special character to transliterate European "f": U+095E (dec.
2398). It looks like PHA with a dot under it.

- Using anusvara for the "n" in "fungal" and a MA for the "m" in "shambler"
seems inconsistent. I'd either use anusvara for both or NGA (U+0919, decimal
2329) for the first one and MA for the second one.

- "Fungal shambler" is two words in English: why did you join them in
Devanagari?

Regards.

_ Marco

RE: In defense of Plane 14 language tags (long)

2002-11-05 Thread Marco Cimarosti

John Cowan wrote:
> Marco Cimarosti scripsit:
> 
> > { As a side note, the idea that a language my use "foreign" 
> words seems
> > terribly naive to me. It is true that, in Italian, we use 
> loanwords such as
> > "hardware", "punk", or "footing", but it would be silly to 
> consider or tag
> > them as "English words". They are genuinely Italian words, [...]
> 
> In English, however, the distinction between borrowings and truly
> foreign words does make sense.  Such a word as 
> Weltanschauung, for example, [...]

It seems that *I* was terribly naive... This distinction does indeed make
sense for Italian too: when we occasionally use English phrases like "last
but not least" or "the American dream", we do approximate the English
pronunciation as good as we can.

> Even in Italian, what about Latin terms embedded in classic poetry?
> Are you going to say that those too are Italian, just with a slightly
> peculiar morphology?

This is a bad example: in poetry and in prose, Latin terms and names are
normally heavily adapted orthographically and morphologically. E.g., the
Italian for "amphorae" is "anfore"; the Italian for "Julius Caesar" is
"Giulio Cesare", etc. Even terms not adapted orthographically, like "Regina
Coeli" are heavily adapted phonetically (/re'dʒina 'tʃeli/).

> Plain-text tags don't nest, however: you need to give a tag explicitly
> naming the outer language when you return to it.

I was talking about nesting plain-text language tags into *rich*-text
language tags.

> > If they are rendered as invisible glyphs, they make the 
> text more difficult
> > to edit and to move the cursor within, because the user 
> will have no way of
> > understanding why the cursor stops twice in apparently 
> random positions.
> > This also exposes the information contained in language tags to be
> > unwillingly corrupted by subsequent editing.
> 
> This argument proves too much: it applies with equal force to the
> invisible bidi controls and the other Unicode controls.  In practice
> these things are not available for plaintext-style editing except in a
> "reveal controls" mode, which could equally well reveal the tags using
> some stylized glyphs.

This doesn't seem a valid argument. Perhaps, those other things deserve to
be deprecated as well. Or perhaps they are so important that they are worth
the trouble.

Talking specifically about the bidi controls, there are a few intrinsic
differences from Plane 14 language tags:

- The meaning of bidi controls is clearly defined in the Unicode standard,
and this meaning is mandatory unless a higher level protocol is in action.
The exact interpretation of plane 14 language tags is entirely left to the
application, so they are just a standard mechanism to define higher level
protocols: they have no meaning without a higher level protocol.

- The bidi controls exist to address (rare, but existing) problems which
would affect the basic readability of the text: specifically, they address
cases in which the apparent reading order would be different from the actual
logical order. Plane 14 language tags mainly address an esthetic/political
issue.

_ Marco

RE: Special characters

2002-11-05 Thread Marco Cimarosti

Johan Marais wrote:
> Could someone tell me whether it is possible to produce the following
characters please?
> 
> k with a small line underneath
> K with a small line underneath

?/? (U+1E35/U+1E34, LATIN SMALL/CAPITAL LETTER K WITH LINE BELOW)

> H with a dot underneath
> h with a dot underneath

?/? (U+1E25/U+1E24, LATIN SMALL/CAPITAL LETTER H WITH DOT BELOW)

> B with a small line underneath
> b with a small line underneath

?/? (U+1E07/U+1E06, LATIN SMALL/CAPITAL LETTER B WITH LINE BELOW)

> D with a small line underneath
> d with a small line underneath

?/? (U+1E0F/U+1E0E, LATIN SMALL/CAPITAL D WITH LINE BELOW)

> G with a line on top
> g with a line on top

?/? (U+1E21/U+1E20, LATIN SMALL/CAPITAL G WITH MACRON)

> E with an upside down ^ on top
> e with an upside down ^ on top

e/E (U+011A/U+011B, LATIN SMALL/CAPITAL LETTER E WITH CARON)

All these can also be obtained with the regular letters *followed* by a
combining diacritic:

x_ (U+0331, COMBINING MACRON BELOW)
x? (U+0323, COMBINING DOT BELOW)
x¯ (U+0304, COMBINING MACRON)
x? (U+030C, COMBINING CARON)

> Mirror image of a comma, but not at the bottom - should be higher, like an
'

It depends on what the means. If it is part of the spelling, it should
probably be:

? (U+02BD, MODIFIER LETTER REVERSED COMMA)

If it is punctuation (open quote), it should be:

? (U+201B, SINGLE HIGH-REVERSED-9 QUOTATION MARK)

_ Marco

RE: In defense of Plane 14 language tags (long)

2002-11-05 Thread Marco Cimarosti

Doug Ewell wrote:
> [...]
> Readers are asked to consider the following arguments individually, so
> that any particular argument that seems untenable or contrary to
> consensus does not affect the validity of other arguments.
> [...]

Here are my three pence *pro* the deprecation:

> 1.  Language tags may be useful for display issues.
> 
> The most commonly suggested use, and the original impetus, 
> for Plane 14 language tags is to suggest to the display
> subsystem that “Chinese-style” or “Japanese-style” glyphs
> are preferred for unified Han characters. [...]

IMHO, there has never been any practical need to consider these glyphic
differences in plain text. It is a non-issue raised to the rank of issue
because of obscure political reasons.

It is false that Japanese is unreadable if displayed with Chinese-style
glyphs, or that Polish is unreadable if displayed with Spanish-styles acute
accents.

It is true that any language looks odd if displayed with an improper font,
and that these esthetic issues must be properly addressed in "rich text" and
in decent typography.

But such a level of graphical correctness does not apply to plain text: if
it would apply, we should also rule out many other typographic
simplifications which are in current use, such as fixed-width fonts for
Western script, fixed-height fonts for the Arabic script, horizontal display
of Japanese, etc.

> 2.  Language tags may be useful for non-display issues.
> 
> Although not frequently mentioned, plain-text language tagging could
> also be useful for applications such as speech synthesis,
> spell-checking, and grammar checking. [...]

These kinds of applications cannot rely on the presence of any kinds of
language tagging because, in most real-word cases, this will not be present.

{ As a side note, the idea that a language my use "foreign" words seems
terribly naive to me. It is true that, in Italian, we use loanwords such as
"hardware", "punk", or "footing", but it would be silly to consider or tag
them as "English words". They are genuinely Italian words, as demonstrated
by the fact that their pronunciation is very different from the English
(['ɑrdwer(e)], ['pɑŋk(e)] and ['futiŋg(e)], respectively), that their
morphology is different (e.g., plural is invariable), and that their meaning
is slightly different ("hardware" only refers to computers, "punk" only
refers to music and fashion), or even totally different from the English
original ("footing" means "jogging"). }

> 3.  Conflict with HTML/XML tags need not be a problem.
> 
> A common criticism of the Plane 14 language tags is that higher-level
> protocols such as HTML and XML already provide a mechanism 
> for language tagging.  There is a concern that the language specified
> by the “lang” attribute in HTML or “xml:lang” attribute in XML could 
> conflict with the one specified in a Plane 14 language tag, [...]

As I see it, the problem is not merely that the two fashions of tags may
specifying different languages. That would not be a real conflict. It is
perfectly legitimate to embed language tags into each other: the rule is
that the inner language tag wins. This general rule can be extended to
accommodate plain text tags, they will always take the precedence as they
clearly are the innermost specification.

The real problem is with *overlapping* and *unpaired* tags. XML parsers have
built in validation of the tree structure of a document, which ensures that
all tags are properly opened, closed and embedded into each other. E.g.,
overlapping spans like:

 ABC  DEF  GHI 

would not pass validation because the English and French span overlap
irregularly (as do tags  and ).

But that built-in validation cannot properly detect a situations like:

 ABC \uE0001 \uE0066 \uE0072 DEF  GHI \uE007F

where the English span (specified in tag ) overlap with the French span
(specified with plain text tags).

Just suggesting to ignore plain text tags is no solution, because this would
waste part of the information (and the author's effort provide this
information).

> 6.  Plane 14 tags are easy to filter out, and harmless if not
> interpreted.

If they are not processed correctly or filtered out, they are by no means
harmless.

If they are rendered as visible glyphs (such as [LNG][f][r]) or with
"missing glyph" boxes, they clutter the text, making it less readable --
i.e., they pejorate the main problem that they were supposed to solve.

If they are rendered as invisible glyphs, they make the text more difficult
to edit and to move the cursor within, because the user will have no way of
understanding why the cursor stops twice in apparently random positions.
This also exposes the information contained in language tags to be
unwillingly corrupted by subsequent editing.

_ Marco

RE: Header Reply-To

2002-11-04 Thread Marco Cimarosti

Stefan Persson wrote:
> > > Why doesn't that page follow the ASCII standard and/or 
> any ASCII-based
> > > standard?
> >
> > What? As far as I can tell, it's 100% ASCII.
> 
> It doesn't follow the ASCII standard as far as quotation marks are
> concerned.

Using ` and ' as quotation marks is a long-standing Internet convention. See
RFC #20, October 1969 (sic!):

http://www.faqs.org/rfcs/rfc20.html

"4.2 Graphic Characters
[...]
2/7 '   Apostrophe (Closing Single Quotation Mark Acute Accent [2])
[...]
6/0 `   Grave Accent [2,3] (Opening Single Quotation Mark)
[...]
2 The use of the symbols in 2/2, 2/7, 2/12, 5/14, /6/0, and 7/14 as
diacritical marks is described in Appendix A, A5.2
[...]"

RCF #20 contains no Appendix A, so I guess that this usage is described in
the ASCII standard itself.

_ Marco

RE: Character identities

2002-10-30 Thread Marco Cimarosti

Kent Karlsson wrote:
> > I insist that you can talk about character-to-character 
> > mappings only when
> > the so-called "backing store" is affected in some way.
> 
> No, why?  It is perfectly permissible to do the equivalent
> of "print(to_upper(mystring))" without changing the backing
> store ("mystring" in the pseudocode); to_upper here would
> return a NEW string without changing the argument.

And that, conceptually, is a character-to-glyph mapping.

In my mind, you are so much into the OpenType architecture, and so much used
to the concept that glyphization is what a font "does", that you can't view
the big picture.

If you look at Unicode from a platform independent perspective, fonts do not
necessarily "do" something. In some architectures, fonts are just inert
repository of glyphs, and the display "intelligence" is somewhere out of the
font.

> > If the backing store
> > is not changed, it is only a character-to-glyph mapping, 
> > however complicate and indirect it may be.
> 
> Yes.  But with several font technologies "the user" can affect
> the mapping in some ways, via "features". [...]

Even in the simplest of technologies, the user can affect the mapping in
some way, e.g. using a different font.

> My claim is that it is a bad idea for fonts (I don't dare
> say "Unicode font" at this point) to do what *amounts to*
> such in-effect character mappings *without explicit request*
> from whoever is "in charge of" the text in some way (author,
> editor, graphic designer, reader who like to make changes to
> the text, ...).  Such changes should NOT be the result of
> JUST changing font.

All undue generalizations of the OpenType paradigm. Not all fonts "do"
something (let alone doing what you wish them to do); not all font
technologies have "modes" (better said, *no* font technologies have "modes",
if not in theory).

> > To me, a glyph floating atop of letters "a", "o" and "u" is 
> > recognizably a
> > German umlaut if (a) the text is written in German, and (b) 
> > the glyph has
> > one of the following shapes:
> > 
> > 1. Two small "blobs" (e.g. circles, squares, acute accents) 
> > places side by side;
> 
> I'm going to opt staying on the restrictive side here.
> 
> Except for the last one, that is a diaeresis, yes.  That is the
> modern standard way of writing "umlaut" in typeset German. The
> last one is a double acute, which is normally not used for this
> in German, and it is stretching things a bit too far to consider
> it a glyph variant of diaeresis.

I think stretching things is not seeing that the "umlaut" of most Fraktur
fonts looks like a double acute: a shape which is consistent with the usual
shape of the dots on "i" and "j".

BTW, strangely, you don't seem to be worried by the fact that also "i" and
"í" look the same... What if I use Fraktur for Spanish?

[...]
> > >If (and only if!) the author/editor of the text asks for an
> > > overscript e should the font produce one. It is not up to
> > > the font maker to make such substitutions without request,
> > 
> > Yes. But a font which displays U+0308 with a glyph resembling 
> > the typical
> > glyph for U+0364 is not "producing" anything; it is not 
> "substituting"
> > anything with anything else: it is just faithfully 
> > reproducing the text,
> > according to the content decided by the author *and* 
> according to the
> > typographical style decided by the font designer.
> 
> This is not a typographic decision, it is a spelling decision,
> and not up to the font designer, I'd say.  It is a typographic
> decision whether the diaeresis "digs into" the glyph below, or if
> an e-above looks like a capital e inside.  But spelling changes,
> whether transient or permanent, should be the "author's" call.

It is a cat biting its tail (*). If you consider it a "glyph variation", it
is just a typographic decision; if you consider it a "character change", it
becomes an orthographic issue.

But considering a "character change" the fact that a certain code point is
displayed with a certain glyph is, IMHO, totally out of the letter and
spirit of the Unicode character-glyph model.

(*: Am I exporting an Italian idiom or is this used in English too? Anyway,
it means "a chicken-egg issue")

_ Marco

RE: RE: Character identities

2002-10-30 Thread Marco Cimarosti

I said:
> Ah! I never realized that the Sütterlin zig-zag-shaped "e" 
> was the missing with the "¨" glyph!
^

Sorry: "... the missing LINK with ...".

_ Marco

RE: RE: Character identities

2002-10-30 Thread Marco Cimarosti

Doug Ewell wrote:
> Actually, the Sütterlin umlaut-mark is a small italicized
> "e," which is very similar to an "n."  What it really
> ends up looking like, from a distance, is a double acute.

Ah! I never realized that the Sütterlin zig-zag-shaped "e" was the missing
with the "¨" glyph!

Thanks! After all, this discussion has not been completely useless. :-)

_ Marco

RE: Character identities

2002-10-30 Thread Marco Cimarosti

Alain LaBonté wrote:
> [Alain]  However I agree with Kent. Let's say a text 
> identified as German quotes a French word with an
> U DIAERESIS *in the German text* (a word like
> "capharnaüm").

A Fraktur font designed solely for German should not be used for typesetting
French words. (And, BTW, that is probably why German Fraktur books used
roman type for foreign words).

In general, you cannot expect a good result using a font designed for one
language to typeset another: see, in the attached image, what your
"capharnaüm" looks like in a font designed for Chinese. Nice typography, eh?
That "ü" is so weird because it is designed to be used in conjunction with
the full width letters in U+FF41..U+FF5A, which is perhaps the right choice
for Chinese, but not for French.

_ Marco

<>

RE: New Charakter Proposal

2002-10-30 Thread Marco Cimarosti

Dominikus Scherkl wrote:
> I would like to have a "source failure indicator symbol" (SFIS)
> charakter in the unicode, which a charset-convertion unit may
> insert into a text (Suggeested position: U+FFF8).
> 
> [...]
> 
> Of course a converter can still use U+FFFD if it has no
> idea which character is intended or if unicode doesn't contain
> the character.

I remember reading on this list about a proposal to allocate 256 code points
to represent the bytes of a non-Unicode character set which could not be
converted to Unicode.

What happened to that proposal? Was it ever formalized? If yes, was it
refused?

> The whole "charakter identities"-discussion gave me another
> reason to introduce such a SFIS-charakter:
> A font-renderer may show the SFIS before a charakter which
> is replaced by another one [...]

Sorry for repeating myself, but my opinion is that a renderer is *never*
allowed to change one character to another. IMHO, all that discussion was
about the shape of glyphs, not about changing characters.

> I'd like to hear if my suggestion is completely weird or
> if anybody else think it might be useful.

One problem can be the nature of the code point which follows the "SFIS".

Imagine that a stream, encoded in a certain character set, contains the byte
0xBF and that this byte is undefined in that character set. Mapping the
stream to Unicode, you convert 0xBF into a sequence of "SFIS" and U+00BF.
Clearly, that U+00BF would just be a placeholder for the unknown byte, not
an "INVERTED QUESTION MARK". 

The problem is that interpreting U+00BF as anything different from an
"INVERTED QUESTION MARK" violates Unicode Conformance Requirement C7: "A
process shall interpret a coded character representation according to the
character semantics established by this standard, if that process does
interpret that coded character representation."

Another problem, more practical, is that if the unrecognized byte is in
ranges 0x00..0x1F and 0x7F..0x9F, this would generate the code point of an
Unicode control character, and this could have undesired effects. E.g.,
U+ is often a string terminator; U+001B could trigger unexpected escape
sequences, etc.

_ Marco

RE: Character identities

2002-10-30 Thread Marco Cimarosti

Keld Jørn Simonsen wrote:
> On Tue, Oct 29, 2002 at 09:07:16PM +0100, Marco Cimarosti wrote:
> > Kent Karlsson wrote:
> > > Marco, 
> > 
> > Keld, please allow me to begin with the end of your post:
> 
> I really have not contributed much to this thread, I think you mean
> "Kent".

Oh No! Again! Apologies to both of you! I seriously start to be worried
about my dislexia...

_ Marco

RE: Character identities

2002-10-29 Thread Marco Cimarosti

Kent Karlsson wrote:
> Marco, 

Keld, please allow me to begin with the end of your post:

>Marco, please calm down and reread every sentence of my
> previous message.  You seem to have misread quite a few things,
> but it is better you reread calmly before I try to clear
> up any remaining misunderstandings.

I have been absolutely calm, and I apologize if I gave a different
impression. I may happen to heat up when discussing things like ethics,
politics, religions, racism, war, etc., but definitely not when discussing
about the details of the Unicode character-glyph model.

I wish to recall that we are just discussing about a glyph variation for a
diacritic character: a variation that I consider acceptable and you consider
undesirable. Please let's not make this bigger than it could reasonably be.

>Standard orthography, and orthography that someone may
> choose to use on a sign, or in handwriting, are often not
> the same.
> 
>And I did say that current font technologies (e.g. OT)
> does not actually do character to character mappings,
> but the net effect is *as if* they did (if, and I hope
> only if, certain "features" are invoked, like "smallcaps").
> It would be more honest to do them as character-to-character
> mappings though, either inside (which OT does not support)
> or outside of the font.  Capital A, even at x-height, is not
> a glyph variant of small a (even though, centuries ago, that
> was the case, but then I and J were the same, and U and V,
> et and &, ad and @, ...). But displaying U as V (in effect
> doing a character replacement on a copy of the input) would
> be ok in a non-default mode (using the "hist" feature, say).

I insist that you can talk about character-to-character mappings only when
the so-called "backing store" is affected in some way. If the backing store
is not changed, it is only a character-to-glyph mapping, however complicate
and indirect it may be.

Whether these mappings takes part inside or outside a font is irrelevant as
far, again, as the backing store is not changed.

> My point here is that that replacement (effectively) should
> not be done by default in a Unicode font (see Doug's explanation
> for what a Unicode font is, if you don't like mine).

I totally agree with Doug's careful definition, and I am glad that you agree
as well.

Doug indicates two key points that a font must respect to be suitable for
Unicode:

Â« [...] calling a font a "Unicode font" implies two things:
1. It must be based on Unicode code points. [...]
2. The glyphs must reflect the "essential characteristics" of the Unicode
character to which they are mapped. [...] Â»

If we agree that the only requirement for a glyph representing a certain
Unicode character is to respect the "essential characteristics" which make
it recognizable, then all our discussion is simply about determining which
"essential characteristics" a particular character is supposed to have.

To me, a glyph floating atop of letters "a", "o" and "u" is recognizably a
German umlaut if (a) the text is written in German, and (b) the glyph has
one of the following shapes:

1. Two small "blobs" (e.g. circles, squares, acute accents) places side by
side;
2. A straight horizontal line;
3. A wavy horizontal line;
4. a small lowercase "e", or something recalling it.

I don't argue this for caprice or provocation, but because these particular
shapes are commonly attested in one context or another: be it modern
typography, traditional typography, handwriting, fancy graphics, etc.

You seem to argue that only case 1 is acceptable, and probably also add some
constraints on the shape of the "blobs" (e.g., I think I understood that you
find that a double acute shape would be unacceptable).

As I see it, the only reason for which you say this is because the other
shapes are similar or identical to the typical shapes of other Unicode
characters. As I said, I don't find that this is valid reason, unless the
font we are talking about is to be used in contexts (e.g., linguistics, or
languages other than German) in which the distinction is meaningful.

> > [...] I never heard that U+0364
> > (COMBINING LATIN SMALL LETTER E) is part of the spelling of 
> > modern German or Swedish.
> 
>True (that is not part of modern standard orthography),
> but I don't see how that could imply some kind of support
> for your (rather surprising and extreme) position.

(Frankly, I find surprising and extreme your position -- perhaps we're only
choosing bad examples.)

What I meant is that if (a) U+0364 is not supposed to appear in modern
German, and (b) the font we are considering is designed to be used for
modern German only, then (c) the possibility of confusing U+0364 with U+0308
is a non issue.

>If (and only if!) the author/editor of the text asks for an
> overscript e should the font produce one. It is not up to
> the font maker to make such substitutions without request,

Yes. But a font which displays U+0308 with a glyph resemblin

RE: Character identities

2002-10-29 Thread Marco Cimarosti

Kent Karlsson wrote:
> > The claim was that dieresis and overscript e are the same 
> in *modern*
> > *standard* German. Or, better stated, that overscript e is
> > just a glyph
> > variant of dieresis, in *modern* *standard* German typeset 
> in Fraktur.
> 
> Well, we strongly disagree about that then.  Marc and I 
> clearly see them as different.  More about this below.

We could simply agree to disagree, weren't it for the fact that we both
claim that each other's view violates the principles of Unicode.

I have tried to show that glyphic variation is part the principles of
Unicode, as per TUS 3.0. You might wish to point us to where the current
Unicode Standard support your view, or contradicts mine.

> > However, IMHO, the presence U+0364 (COMBINING LATIN SMALL
> > LETTER E) in a
> > modern German or Swedish text is just a plain spelling error,
> > and even the
> > naivest spellchecker should flag it as such.
> 
> So what? NaÃ¯ve spell checkers flag all kinds of correctly spelled
> words!

Yes but, IMHO, in this case they would be right: I never heard that U+0364
(COMBINING LATIN SMALL LETTER E) is part of the spelling of modern German or
Swedish.

> Not quite.  Please note that some characters are defined to have
> very specific glyphs, e.g. the estimated sign, there is no shape
> variability except for size.

A small set of *symbols* like the estimate sign and some dingbats are an
exception to the rule that Unicode encodes character but not glyphs.

> Others are "glyphically allocated/
> unified", like the diacritics, and some glyphic variability is
> expected. But a diaeresis is two dots (of some shape, and it would
> be a margin case to have them elongated), never a tilde, macron
> or overscript e.

Would you care to go in Germany and have a look at shop signs? The umlaut is
more often a straight line than not. But this doesn't make it a "macron":
there is no macron in German.

> Those are other characters, not just a glyph variation.

So I was wrong: German orthography uses macrons! Can you please explain the
German pronunciation of "Ä", "Å" and "Å«"?

> Other characters have more glyphic variability
> (informally) associated with them, like A, but some of them
> have compatibility variants that have a somewhat more restricted
> glyphic variability, like the Math Fraktur A in plane 1.

More *symbol* characters which escape the general rule. 

> Some scripts have by tradition some very "strong" ligatures;
> "strong" in the sense that may be hard to recognise the ligated
> pieces in the result glyph.  That does not mean that you can
> legitimately use an M glyph for One Thousand C D, just because
> they "mean" the same.

Perhaps. It could have been a poor example. But the opposite is much more
important: you cannot use a character in place of another which "means" a
different thing just because you want a different "look".

> Nor does that mean that diacritics can be
> substituted for each other, asking for a diaeresis and get a tilde.

Substituting diacritics for each other is what *you* seem to suggest!

> Yes, it is common practice with many to use a tilde instead of
> a diaresis in handwriting, but it is still character substitution,
> not a glyphic variant (since that is the way diacritics are
> allocated in Unicode).

So, German orthography uses tildes too! Can you please explain the German
pronunciation of "Ã£", "Ãµ" and "Å©"?

> > What Unicode really mandates is that the encoding should 
> not change to
> > obtain a certain graphic effect.
> 
> You can do any character mappings you like before you apply any
> font, or make it into graphics...

There can be no character-to-character mapping inside a font or a display
engine! Applications are allowed to do character-to-character mappings only
when they want to *change* the text in some way (e.g., a case conversion, a
transliteration, etc.), not when they want to display it.

Displaying Unicode only implies character-to-glyph mappings. Internally,
there can be some glyph-to-glyph mapping, but never a character-to-character
mapping. Even character-to-character mappings done on a temporary copy of
the text are, conceptually, a step on the  character-to-glyph mapping.

This fundamental error spreads throughout all your post, and makes it
impossible to go into the details without keeping on saying: you can't do
any character-to-character mappings during display; you can't do any
character-to-character mappings during display; you can't do any...

> I was trying to be general (not fancy) and not just talk about
> Opentype.  But yes, I meant (at least) the case where no
> "features" (or similar) are invoked.

Who tells you that there are any "features" to be invoked? There is no
similar requirement in Unicode!

> What I was aiming at excluding were "features" that implicitly
> involve character mappings, [...]

You see? "You can't do any character-to-character mappings during display."
For simplicity, I will simply cut off all passages where you a

RE: Character identities

2002-10-28 Thread Marco Cimarosti

Kent Karlsson wrote:
> > > For this reason it is quite impermissible to render the
> > > combining letter small e as a diaeresis
> >
> > So far so good. There would be no reason for doing such a thing.
> ...
> > > or, for that matter, the diaeresis as a combining
> > > letter small e (however, you see the latter version
> > > sometimes, very infrequently, in advertisement).
> >
> > This is the case I though we were discussing, and it is a
> > very different case.
> 
> No, the claim was that diaresis and overscript e are the same,

The claim was that dieresis and overscript e are the same in *modern*
*standard* German. Or, better stated, that overscript e is just a glyph
variant of dieresis, in *modern* *standard* German typeset in Fraktur.

Sorry if I haven't stated this clearly enough.

> so the reversed case Marc is talking about is not different at all.

It is. In the first case, we are talking about a glyph variant in *modern*
*standard* German, in the second case, we are talking about two different
diacritics in some *other* context. (Ancient German? ancient Swedish?).

> > Standing Keld's opinion and Marc's wholehearted support, it
> 
> Please don't confuse me with Keld!

Oooops! My apologies!

> > follows that
> > those infrequent advertisements should be encoded using U+0364...
> >
> > But U+0364 (COMBINING LATIN SMALL LETTER E) belongs to a
> > small collection of
> > "Medieval superscript letter diactrics", which is supposed 
> to "appear
> > primarily in medieval Germanic manuscripts", or to reproduce
> > "some usage as late as the 19th century in some languages".
> 
> Yes, but you should not read too much into the explanation,
> which, while correct, does not limit the existence of their
> glyphs to fonts used only by germanic professors...
> Some of them (overscript e in particular) should be(come)
> quite commonly occurring in any Fraktur Unicode font.

"Commonly" sounds funny near "Fraktur"...

> > Using such a character to encode 21st century advertisements
> > is doomed to cause problems:
> >
> > 1) The glyph for U+0364 is more likely found in the font
> > collection of the
> > Faculty of Germanic Studies that on the PC of people wishing
> > to read the
> > advertisement for "Ye Olde Küster Pub". So, most people will
> > be unable to
> > view the advertisement correctly.
> >
> > 2) The designer of the advertisement will be unable to use
> > his spell-checker and hyphenator on the advertisement's text.
> 
> Advertisements should invariably be final spell-checked and
> hyphenated by humans!  Automated spell checkers and hyphenators
> for German (as well as Scandinavian languages) have (so far)
> not been good enough even for running text that you want to
> publish...

This has no connection with this discussion.

However, IMHO, the presence U+0364 (COMBINING LATIN SMALL LETTER E) in a
modern German or Swedish text is just a plain spelling error, and even the
naivest spellchecker should flag it as such.

> > 3) User's will be unable to find the Küster Pub by searching
> > "Küster" in a
> > search engine.
> 
> Depends on the search engine, and if it uses a correct collation
> table (for the language) or not...
>
> > What will actually happen is that everybody will see an empty
> > square, so
> > they'll think that the web designer is an idiot, apart the
> > professors at the
> > Faculty of Germanic Studies, who'll think that the designer
> > is an idiot
> > because she doesn't know the difference between U+0308 and
> > U+0364 in ancient German.
> 
> Most modern use of Fraktur seem to use diaeresis or double
> acute for this. 

U+0308 (COMBINING DIAERESIS) should be the only "umlaut" to be found in
modern German text. What that diacritic *looks* like (two dots, an "e", a
double acute, a macron, Mickey Mouse's ears), is a choice of the font
designer.

> (But the web designer could use a dynamically
> downloaded font fragment, if there is worry that all glyphs
> might not be supported by the fonts used by the vast majority
> of the target audience.)

This too has no connection with this discussion, and is OT. Unicode is
concerned with how text is *encoded* the details of fonts and display
technology are out of scope.

What Unicode really mandates is that the encoding should not change to
obtain a certain graphic effect.

> > The real error (IMHO) is the idea that font designers should
> > stick to the
> > *sample* glyphs printed on the Unicode book, because this 
> would force
> 
> Well, the diacritics are allocated/unified on glyphic grounds.
> While a diaeresis may look different from font to font, it is
> basically two "dots" (of some shape in line with the design of the
> font), never an "e" shape.  At least not in the *default mode* of a
> *Unicode font*.
>
> And overscript small e will also vary with the font,
> looking like a shrunken ordinary e glyph of (ideally) the same font.
> But never like two dots (in the default mode of a Unicode font).

You haven't yet defined your meaning of "Uni

RE: Character identities

2002-10-25 Thread Marco Cimarosti

Kent Karlsson wrote:
> >... Like it or not, superscript e *is* the
> > same diacritic
> > that later become "¨", so there is absolutely no violation of
> > the Unicode
> > standard. Of course, this only applies German.
>
> Font makers, please do not meddle with the authors intent
> (as reflected in the text of the document!).  Just as it
> is inappropriate for font makers to use an ø glyph for ö
> (they are "the same", just slightly different derivations
> from "o^e"), it is just as inappropriate for font makers to
> use a "o^e" glyph for ö (by default in a Unicode font). Though
> in some sense the "same" they are still different enough for
> authors to care, and it is up to the document author/editor
> to decide, not the font maker.

It is certainly up to the author of the document to decide.

But, as I explained more at length in my reply to Marc, the are two
different approaches for deciding this:

1. When this decision is a matter of *content* (as may be the case when
writing about linguistics, to differentiate spellings with "o^e" from
spellings with ö), it is more appropriate to make the difference at the
*encoding* level, by using the appropriate code point.

2. When this decision is only a matter of *presentation*, it is more
appropriate to make the difference by using a font which uses the desired
glyph for the normal "¨".

> If the "umlaut" to "overscript e" transformation is put under
> this feature for some fonts, I see no major reason to complain...
> (As others have noted, it does not really work for the long s,
> unless the language is labelled 'en'...)

And, of course, in an ideal word option 2 will be done by switching a font
feature, rather than switching to an ad-hoc font. This makes it possible for
font designers to provide a single font which covers both needs. But this is
just optimization, not compliance!

_ Marco

RE: Character identities

2002-10-25 Thread Marco Cimarosti

Marc Wilhelm Küster wrote:
> At 14:04 25.10.2002 +0200, Kent Karlsson wrote:
> >Font makers, please do not meddle with the authors intent
> >(as reflected in the text of the document!).  Just as it
> >is inappropriate for font makers to use an ø glyph for ö
> >(they are "the same", just slightly different derivations
> >from "o^e"), it is just as inappropriate for font makers to
> >use a "o^e" glyph for ö (by default in a Unicode font). Though
> >in some sense the "same" they are still different enough for
> >authors to care, and it is up to the document author/editor
> >to decide, not the font maker.
> 
> My wholehearted support!
> 
> [...]
> 
> For this reason it is quite impermissible to render the 
> combining letter small e as a diaeresis

So far so good. There would be no reason for doing such a thing.

If the author of a scholarly work used U+0364 (COMBINING LATIN SMALL LETTER
E), this character should be displayed as either a letter "e" superscript to
the base letter, or as an empty square (for fonts not caring about that
character).

> or, for that matter, the diaeresis as a combining 
> letter small e (however, you see the latter version
> sometimes, very infrequently, in advertisement).

This is the case I though we were discussing, and it is a very different
case.

Standing Keld's opinion and Marc's wholehearted support, it follows that
those infrequent advertisements should be encoded using U+0364...

But U+0364 (COMBINING LATIN SMALL LETTER E) belongs to a small collection of
"Medieval superscript letter diactrics", which is supposed to "appear
primarily in medieval Germanic manuscripts", or to reproduce "some usage as
late as the 19th century in some languages".

Using such a character to encode 21st century advertisements is doomed to
cause problems:

1) The glyph for U+0364 is more likely found in the font collection of the
Faculty of Germanic Studies that on the PC of people wishing to read the
advertisement for "Ye Olde Küster Pub". So, most people will be unable to
view the advertisement correctly.

2) The designer of the advertisement will be unable to use his spell-checker
and hyphenator on the advertisement's text.

3) User's will be unable to find the Küster Pub by searching "Küster" in a
search engine.

What will actually happen is that everybody will see an empty square, so
they'll think that the web designer is an idiot, apart the professors at the
Faculty of Germanic Studies, who'll think that the designer is an idiot
because she doesn't know the difference between U+0308 and U+0364 in ancient
German.

The real error (IMHO) is the idea that font designers should stick to the
*sample* glyphs printed on the Unicode book, because this would force
graphic designer to change the *encoding* of their text in order to get the
desired result.

Another big error (IMHO, once again) is the idea that two different Unicode
characters should look different. The difference must be preserved when it
is useful -- e.g., U+0308 should not look like U+0364 in a font designed for
publishing books on the history of German!

What should really happen, IMHO, is that modern German should be encoded as
modern German. A U+0308 (COMBINING DIAERESIS) should remain a U+0308,
regardless that the corresponding glyph *looks* like U+0364 (COMBINING LATIN
SMALL LETTER E) in one font, and it looks like U+0304 (COMBINING MACRON) in
another font, and it looks like two five-pointed start side-by-side in a
third font, and it looks like Mickey Mouse's ears in ...

_ Marco

RE: need open source tools to convert indic font encoding into ISCII or Unicode

2002-10-25 Thread Marco Cimarosti

Frank Tang wrote:

> I am looking for open source tool (C / C++ / Perl or Java) to convert 
> between (UTF-8/UTF-16 or ISCII) and differnt  Indict font encoding. 
> Please let me know if you know anything available.
> 
> Language:
>  C,
>  [...]
> 
> Convert from A to / from B where
> A mean
>  UTF-8
>  UTF-16, or
>  ISCII
> 
> B mean
>  font encoding of Nadunia font
>  font encoding of Shusha font
>  font encoding of DV-Suresh font
>  font encoding of  DV-Yogesh font
>  font encoding of Mangal font  (that is just OpenType, is it ?)
> 
> Also conversion between
>  UTF-16 and ISCII

Check out these ongoing project:

- ISSCIIlib (http://www.cse.iitk.ac.in/users/isciig/isciilib/main.html)
- iconverter (http://www.cse.iitk.ac.in/users/isciig/iconverter/main.html)

Both hosted at the Linux Technology Development for Indian Languages
(http://www.cse.iitk.ac.in/users/isciig/).

_ Marco

RE: Character identities

2002-10-25 Thread Marco Cimarosti

Peter Constable wrote:
> >> then *any* font having a unicode cmap is a Unicode font.
> >
> >No, not if the glyps (for the "supported characters") are
> >inappropriate for the characters given.
> 
> Kent is quite right here. There are a *lot* of fonts out 
> there with Unicode
> cmaps that do not at all conform to the Unicode standard  ---
> custom-encoded (some call them "hacked") fonts, usually abusing the
> characters that make up Windows cp1252.

IMHO, you are confusing two very different things here:

1) Assigning arbitrary glyphs to some Unicode characters. E.g., assigning
the "$" character to long S; the ASCII letters to Greek letters; the whole
Latin-1 range to Devanagari characters, etc.

2) Choosing strange or unorthodox glyph variants for some Unicode
characters.

The "hacked fonts" you mention are case (1); what is being discussed in this
thread is case (2). Like it or not, superscript e *is* the same diacritic
that later become "¨", so there is absolutely no violation of the Unicode
standard. Of course, this only applies German.

The fact that umlaut and dieresis have been unified in Unicode, makes such a
variant glyph only applicable to a font targeted to German. You could not
use that font to, e.g., typeset English or French, because the "¨" in
"coöperation" or "naïve" is a dieresis, not an umlaut sign.

There are other cases out there of Unicode fonts suitable for Chinese but
not Japanese, Italian but not Polish,  Arabic but not Urdu, etc. Why should
a Unicode font suitable for German but not for English be any worse?

_ Marco

RE: Character identities

2002-10-24 Thread Marco Cimarosti

Kent Karlsson wrote:
> And it is easy for Joe User to make a simple (visual...)
> substitution cipher by just swiching to a font with the
> glyphs for letters (etc.) permuted.  Sure!  I think it
> would be a bad idea to call it a "Unicode font" though...
> (That it technically may have a "unicode cmap" is beside
> my point.)

The only meaning that I can attach to the expression "Unicode font" is a
pan-Unicode font: a font which covers all the scripts in Unicode.

If this is what you mean, then displaying "ä" as an "a^e" is clearly not a
good idea. But neither choosing Fraktur glyphs would be a good idea! How can
you have Fraktur IPA!? Fraktur Pinyin!? Fraktur Devanagari!? Fraktur
Arabic!? In general, no noticeable difference from the glyphs used on the
Unicode book would be a good idea for a pan-Unicode font.

But if by "Unicode font" you just mean a font which is compliant with the
Unicode standard, but only supports one or more of the scripts, then *any*
font having a unicode cmap is a Unicode font. And also many fonts *not*
having a Unicode cmap are, provided that something inside or outside the
font knows how to pick up the right glyphs.

In this sense, what is or is not appropriate depends on the font's style and
targeted usages and languages: there are fonts which don't have dots over
"i" and "j"; fonts where U+0059 and U+03A5 look different; fonts where
U+0061, U+0251, U+03B1 and U+FF41 look identical; fonts where capital and
small letters look identical...

Why can't there be a Fraktur font where "ä" and "a^e" look identical, if
this is appropriate for that typographical style and for the usages and
languages intended for the font?

Ciao.
Marco

RE: Sorting on number of strokes for Traditional Chinese

2002-10-16 Thread Marco Cimarosti

John H. Jenkins wrote:
> >> I wonder Unicode provide us a way to do sorting on number of
> >> strokes for Traditional Chinese characters.
> 
> The Unihan database has total stroke count for many (but not all) 
> characters.  It may provide an adequate first-order set of data for a 
> pure stroke-based ordering in TC.

The next step is knowing *which* strokes make up each characters, in order
to properly sort characters having the same stroke number.

Is there any online source for such data? Even for smaller sets than Unicode
CJK.

_ Marco

RE: is this a symbol of anything? CJK?

2002-10-11 Thread Marco Cimarosti


John Delacour wrote:
> At 3:48 pm -0600 10/10/02, John H. Jenkins wrote:
> 
> >I think it's a variant turtle ideograph.  :-)
> >
> >(Nothing bad, so far as I know.)
> 
> Hmm.  Even without looking at the character it sounds very risky to 
> me and is likely to be extremely offensive. Turtles eggs etc.?

I hope it's not "horse eggs".

_ Marco

RE: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Marco Cimarosti

Radovan Garabik wrote:
> Google is your friend :-)
> "i18n" is first mentioned in USENET on 30 nov 1989,

Cute, I didn't imagine Google archives went all that way back!

BTW, the first mention of Unicode on Usenet predates it by eight days:

Subject: Re: ASCII for national characters
Newsgroups: comp.std.internat
From: Donn Terry
Date: 1989-11-22 10:43:42 PST
(http://groups.google.com/groups?selm=932%40hpfcdc.HP.COM)
| [...]
| UNICODE:  this isn't a standard but is proposed.  Unifies the Han
|  character sets in the same way as the Latin ones (but with
|  obviously a much bigger payback because of the size).  Fixed
|  length 16 bits.  This fixes the length in characters vs.  length
|  in bytes issue.  (The issue of length in display space is
|  inherently harder because characters do vary in width in natural
|  usage in many phonetic alphabets, as well as in the ideographic
|  ones.  See Arabic and Hindi where the constant-width usage is
|  considered "pretty awful", albeit readable.  (Even in English,
|  good typesetting is not constant width.))
| [...]

The same message also says something about a competing standard:

| [...]
| ISO10646: 32-bit everything code.  Treats the various Han character sets
|  as distinct character sets for each national usage, but unifies the
|  Latin characters into a single set.  Variable length coding possible
|  to reduce space.  Can degenerate to (something close to) 8859.
| [...]

_ Marco

FW: Indic language fonts releasde under GPL by Akruti

2002-10-10 Thread Marco Cimarosti


For everybody's info.

The fonts are designed for "hack encoding", not for Unicode. But the glyphs
look nice, and they are free and GPL-licensed!

Hopefully, some good soul would add all the OpenType stuff in them, sooner
or later.

_ Marco

-Original Message-
Date: Tue, 08 Oct 2002 16:34:51 -
From: "Baiju M" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Indic language fonts releasde under GPL by Akruti

Hi all,
Cyberscape multimedia limited has released Indic language 
TrueType
Fonts (TTF) under GNU General Public License on 2nd October. The
fonts can be downloaded from http://www.akruti.com/freedom/

--
Baiju M

RE: ISO 8859-11 (Thai) cross-mapping table

2002-10-08 Thread Marco Cimarosti

John Aurelio Cowan wrote:)
> Marco Cimarosti scripsit:
> > Talking about the format of mapping tables, I always 
> > wondered why not using ranges. In the case of ISO
> > 8859-11, the table would become as compact as
> > three lines: 
> 
> Well, that wins for 8859-1 and 8859-11 and ISCII-88, where Unicode
> copied existing layouts precisely.  But it wouldn't help other 8859-x
> much if at all,

All 8859 tables would be more succint.

Non-Latin sections use contiguous ranges of letters in alphabetical order
or, however, in the same order used by Unicode; this is also true for most
other non-ISO charsets.

Latin sections are a worse case, but they still benefit slightly, because
characters shared with Latin-in stay the same positions.

> and it requires binary search rather than direct
> array access, which would be a terrible lossage in CJK, where the
> real costs are.

I agree. In the case of CJK it simply doesn't pay.

_ Marco

RE: ISO 8859-11 (Thai) cross-mapping table

2002-10-08 Thread Marco Cimarosti


Kenneth Whistler wrote:
> Elliotte Harold asked:
> 
> > The Unicode data files at 
> > http://www.unicode.org/Public/MAPPINGS/ISO8859/ do not 
> include a mapping 
> > for ISO-8859-11, Thai. Is there any particular reason for this? 
> 
> Just that nobody got around to submitting and posting one.
> 
> Since there was a lot of discussion about this over the weekend,
> I took it upon myself to create and post one in the same format
> as the other ISO8859 tables.

Thanks!

Talking about the format of mapping tables, I always wondered why not using
ranges. In the case of ISO 8859-11, the table would become as compact as
three lines: 

0x00..0xA0  0x..0x00A0  #   NULL..NO-BREAK SPACE
0xA1..0xDA  0x0E01..0x0E3A  #   THAI CHARACTER KO KAI..THAI
CHARACTER PHINTHU
0xDF..0xFB  0x0E3F..0x0E5B  #   THAI CURRENCY SYMBOL
BAHT..THAI CHARACTER KHOMUT

This is probably what most APIs already do at the binary level, where a
bidirectional ISO 8859-11 table can be as compact as 12 bytes:

struct MapSbcs
{
unsigned char   SbcsChar;
unsigned short  BmpChar;
unsigned char   Count;
};
struct MapSbcs Map8859_11 [] =
{
   { 0x00, 0x, 0xA0 },
   { 0xA1, 0x0E01, 0x39 },
   { 0xDF, 0x0E3F, 0x1C }
};

_ Marco

RE: [ANN] World Address Project starts and relies on Unicode heavily

2002-10-07 Thread Marco Cimarosti

Carl W. Brown wrote:
> Marco,
> Things are a bit more complicated.  The address should be in 
> the format & language of the recipient but the country
> should be in the language and positioned according to the
> sending country.

Er... Have I denied this?

> Unicode is not a complete solution.  Yao mentioned Chinese 
> addresses.  These might be in Traditional or Simplified
> font depending on destination.

Unicode has both Traditional and Simplified characters, so do many Unicode
fonts. And the destination doesn't normally change in an address (apart in
science fiction novels which feature wheel-mounted nomad cities).

_ Marco

RE: [ANN] World Address Project starts and relies on Unicode heavily

2002-10-07 Thread Marco Cimarosti


> Dear all,
> 
> World Address Project promotes an idea of utilizing Unicode on online
> shopping websites for solving the international shipping 
> address problem.
> This will greatly benefit both customers and online businesses.
> 
> Please take a look at http://www.bytecool.com/wap/ and feel 
> free to send
> questions or comments.
> 
> Best Regards,
> Yao Ziyuan

A welcome initiative! I especially hope that your FAQ, when it will be
ready, will contain useful suggestions.

I am really quite sick of those forms that, after I have specified my
country is Italy, force me to fill in my "state"! I usually, have to select
"Michigan", which has the same acronym ("MI") as the province of Milan. I
hope I'll never move in the province of Florence, as there is no state in
the US whose acronym is "FI"...

Also, I hope they'll stop refusing forms where I haven't filled the "middle
initial" field.

Ciao.
Marco "X" Cimarosti

RE: Sporadic Unicode revisited

2002-10-02 Thread Marco Cimarosti


Martin Kochanski wrote:
> To come back to the old thread about typing arbitrary Unicode 
> characters in situations where it's not worth installing a 
> special keyboard, I thought that people might be interested 
> in the hexadecimal Alt+ numeric-keypad solution that we're 
> implementing. As implementations of ISO 14755 go, it seems 
> reasonably simple; but comments would be welcome. You can 
> find a description of the input method at 
> http://www.cardbox.net/client/unicode.htm, near the bottom of 
> the page. Remember that this is a fallback method and not 
> intended for long runs of text!
> 
> If there are any inaccuracies or obscurities in that page, it 
> would also be good to hear about those. 
> 
> (Well, not good exactly, but useful).

Nice! Of course, the drawback is that one must have a Unicode chart at hand,
but this is clearly much better than having to convert codes to decimal.

I wonder whether anyone considered implementing an IME based on mnemonic
codes, such as those described in RFC 1345
(http://www.faqs.org/rfcs/rfc1345.html)?

_ Marco

Omega + upsilon ligature?

2002-10-02 Thread Marco Cimarosti


I am trying to identify a Greek glyph found in an ancient Latin text. I have
not seen what it looks like, but it has been described to me as an "8" with
the top circle opened.

The sign was in a word looking like "8???" ("8rôn") and which, according to
the text, corresponds to Latin "urina". If I understand correctly, the text
also says that this sign is a diphthong which in Doric was substituted by a
plain "?" (omega): "Nam olem a Graecis per <8> diphthongum scribebatur, quae
Dorice in ? solet commutari".

Therefore, I tentatively identified the word as "?" ("ôurôn"), and the
unknown glyph ligature as an "??" ligature ("ôu": omegha + upsilon).

Does anyone know whether such a ligature actually existed in old typography?
And was it anything like an open "8"?

Thanks in advance for any info.

_ Marco

< 1 2 3 4 5 6 7 8 9 >

101 - 200 of 857 matches

Mail list logo