Re: Qamats Qatan (was Response to Everson Phoenician and why June 7?)

2004-05-19 Thread John Hudson
Jony Rosenne wrote:
*Except by Jony, who is always encouraging us to use markup 
to make distinctions.

I don't recall saying anything like this in this Phoenician discussion.
Acknowledged. My point was not about that discussion in particular, but about the generic 
question of to what degree plain-text is a requirement, regardless of what one wants to do 
within it. Your frequent refrain that distinctions of shape, for what you consider to be 
the same character (and note that I am not agreeing or disagreeing with any particular 
judgement), should be handled in 'mark-up' presupposes something other than plain-text in 
terms of displaying that distinction. You frequently remind us that there are distinctions 
that are useful to some people, desirable in some circumstances, but which do not 
constitute a *requirement* in plain-text. Fair enough. For this same reason, I don't 
automatically accept the argument, made by Michael earlier today, that 'There is a 
requirement for distinction for X in plain-text'.

On what basis do we decide that X is necessary in plain-text while Y should be done with 
mark-up or some other 'higher level protocol'?

John Hudson


RE: Qamats Qatan (was Response to Everson Phoenician and why June 7?)

2004-05-19 Thread Jony Rosenne


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson
> Sent: Thursday, May 20, 2004 1:08 AM
> To: Michael Everson
> Cc: [EMAIL PROTECTED]
> Subject: Re: Response to Everson Phoenician and why June 7?
> 
> 

...

> 
> In discussions of whether to encode individual 
> characters/glyphs -- and now, it seems, 
> scripts/styles --, much seems to be made of whether there is 
> a requirement to make a 
> distinction in plain-text, while the question of whether 
> there is a requirement to use 
> plain-text in the first place gets asked less often.*
> 
> *Except by Jony, who is always encouraging us to use markup 
> to make distinctions.
> 

I don't recall saying anything like this in this Phoenician discussion.

I only say so when I believe it's true and relevant to Hebrew. It's all very
nice to desire different shapes for different usages of the same character,
but one must also think about the multitude who do not care or know or
desire the distinction.

> 
> John Hudson
> 
> 
> 





Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread John Hudson
Ernest Cline wrote:
I would be very surprised if there were such a cybercafe.  One
that had both a Hebrew-Phoenican and a Hebrew-Hebrew font
with the Hebrew-Phoenician as the default would be much easier
to believe as a possibility.  Still, it is a valid point.  I think that if
Phoenician were to be unified with Hebrew, it would probably
behoove Unicode to establish variation sequences for Phoenician.

Even with a separate Phoenician script, it might be a good idea
to provide variation sequences that could be used to identify
different script styles such as Paleo-Hebrew and Punic
in the plain text.
This is not a practical use of variation sequences if, by this, you mean use of variation 
selectors. What are you going to do, add a variation selector after every single base 
character in the text? Are you expecting fonts to support the tiny stylistic variations 
between Phoenician, Moabite, Palaeo-Hebrew, etc. -- variations that are not even cleanly 
defined by language usage -- with such sequences?

Some people seem keen on variation selectors in the same way that others are keen on PUA: 
as a catch-all solution to non-existent problems.

John Hudson


Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread Ernest Cline



> [Original Message]
> From: John Jenkins <[EMAIL PROTECTED]>
>
> On May 19, 2004, at 5:07 PM, John Hudson wrote:
>
> > Michael, can you briefly outline the points regarding this 
> > 'requirement'? The only one that has been repeatedly referred to in 
> > this too-long discussion is the Tetragrammaton usage; I'm not sure 
> > whether that constitutes a requirement for plain-text or not. What are 
> > the other points?
> >
>
> You go down to your local cybercafe to read your email from your 
> grandmother telling you all about your nephew's bar-mitzvah.  
> Unfortunately, your local cybercafe has no modern Hebrew (or Yiddish) 
> installed, but they *do* have a Phoenician one.  You cannot, as a 
> result, even tell what language your grandmother is writing you in, let 
> alone what it means.

I would be very surprised if there were such a cybercafe.  One
that had both a Hebrew-Phoenican and a Hebrew-Hebrew font
with the Hebrew-Phoenician as the default would be much easier
to believe as a possibility.  Still, it is a valid point.  I think that if
Phoenician were to be unified with Hebrew, it would probably
behoove Unicode to establish variation sequences for Phoenician.

Even with a separate Phoenician script, it might be a good idea
to provide variation sequences that could be used to identify
different script styles such as Paleo-Hebrew and Punic
in the plain text.





Re: ISO 15924 draft fixes

2004-05-19 Thread Michael Everson
At 03:28 +0200 2004-05-20, Philippe Verdy wrote:
It was in the previous list (see the online HTML table 2).
What does that refer to?
Who decides for the addition of scripts in ISO-15924?
The ISO 15924 RA-JAC.
I thought there was a separate technical commity 
and that you were just the bookkeeper of the 
decisions made by this sub-commitee.
With regard to Coptic, and the need to sort out 
the initial difficulties we are having, it seems 
prudent that I do what is necessary to correct 
faults. It is unlikely that the RA-JAC will 
object to this.

It can't be Unicode's UTC alone, as there are 
already codes for bibliographic references that 
are not (and will never) be encoded separately 
in Unicode,so I suppose that there are librarian 
or publishers members with which you have to 
discuss, independantly of the work of Unicode, 
which should only be the registrar for these 
codes. May be there's still no formal procedure, 
and for now the codes are maintainable without 
lots of administration.
Read the standard.
Do you want a script that generate HTML tables from the reference text file?
No. We will handle that in due course.
One final note: there's still a missing closing parenthese in a French name <<
latin (variante brisée >> for the Fraktur script.
I think that has been corrected by now.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: ISO 15924 draft fixes

2004-05-19 Thread Philippe Verdy
From: "Michael Everson" <[EMAIL PROTECTED]>
> >- Where is this line?:
> > Syloti Nagri;Sylo;316;sylotî nâgrî;;2004-09-01
>
> A new script? Oh, it's in the old file and not in
> the new one? It, Coptic, and Phags-pa need to be
> in the list (they are all under ballot).

It was in the previous list (see the online HTML table 2).
Who decides for the addition of scripts in ISO-15924? I thought there was a
separate technical commity and that you were just the bookkeeper of the
decisions made by this sub-commitee. It can't be Unicode's UTC alone, as there
are already codes for bibliographic references that are not (and will never) be
encoded separately in Unicode,so I suppose that there are librarian or
publishers members with which you have to discuss, independantly of the work of
Unicode, which should only be the registrar for these codes. May be there's
still no formal procedure, and for now the codes are maintainable without lots
of administration.

Do you want a script that generate HTML tables from the reference text file?
I'm not an expert in Perl, but my knowledge of PHP or "awk" is enough to create
it.
Or may be a simple Javascript could generate the presentation in browsers.
I suggest you use a spreadsheet for now to allow sorting or moving columns.

One final note: there's still a missing closing parenthese in a French name <<
latin (variante brisée >> for the Fraktur script.




Re: problems in Public Review 33

2004-05-19 Thread Ernest Cline
From: Philippe Verdy 

> Are these permanently assigned non-characters
> encodable in any UTF or in CESU-8?

I would say they are.  While they are not available
for transmission of data, they are perfectly legal
tor internal use.  Indeed, such internal use is
the raison d'etre of the block of non characters
at FDD0..FDEF  An implementation may wish to
either allow or disallow the transformation of
non-characters depending upon how it uses
those codepoints.





Re: problems in Public Review 33 UTF Conversion Code Update

2004-05-19 Thread Kenneth Whistler
/|/|ike (or |\|\ike) responded to Philippe:

> > However I feel it's not legal (or really not recommanded) to encode non-
> > character codepoints xFFFE-x where x is any plane number. So the rules
> > need to be a bit more detailed to exclude them.
> 
>   Why do we need special rules to not encode characters that don't
> exist?  

Please, everybody, before we start another pointless thread,
examine the actual definition of UTF-8 and the rationale
for an encoding scheme.

UTF-8 must be able to represent every Unicode scalar value --
and that *includes* noncharacter code points.

D28 Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.

D29 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit sequence.

Before you all start shooting from the hip about UTF-8 on the
list, please read (and understand) the normative definitions of
these things in the standard.

--Ken

P.S. Whoever (and whatever) is starting to prepend "[BULK]" to
thread topics, would you cease and desist? ;-)




Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread John Jenkins
On May 19, 2004, at 5:07 PM, John Hudson wrote:
Michael, can you briefly outline the points regarding this 
'requirement'? The only one that has been repeatedly referred to in 
this too-long discussion is the Tetragrammaton usage; I'm not sure 
whether that constitutes a requirement for plain-text or not. What are 
the other points?

You go down to your local cybercafe to read your email from your 
grandmother telling you all about your nephew's bar-mitzvah.  
Unfortunately, your local cybercafe has no modern Hebrew (or Yiddish) 
installed, but they *do* have a Phonecian one.  You cannot, as a 
result, even tell what language your grandmother is writing you in, let 
alone what it means.

Of course, this criterion is difficult to apply to two varieties of 
writing separated by thousands of years -- and it might behoove the UTC 
to discuss the problems involved -- but if we accept minimum legibility 
as a factor in deciding when to unify/separate, I think it's a valid 
one.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/



RE: [BULK] - Re: problems in Public Review 33 UTF Conversion Code Update

2004-05-19 Thread Mike Ayers
Title: RE: [BULK] - Re: problems in Public Review 33 UTF Conversion Code Update






From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Philippe Verdy
Sent: Wednesday, May 19, 2004 4:21 PM


> However I feel it's not legal (or really not recommanded) to encode non-
> character codepoints xFFFE-x where x is any plane number. So the rules
> need to be a bit more detailed to exclude them.


    Why do we need special rules to not encode characters that don't exist?  



/|/|ike





Is there a better term than metascript for what I am thinking of?

2004-05-19 Thread Ernest Cline
It's not an actual attested English word, but the term "metascript"
comes reasonably close to a concept I would like to express
in a proposal I am preparing.  A "metascript" as I am defining it,
is a script such as Latin, Cyrillic or Arabic, that has been extended
from a common core in a wide variety of ways to serve the needs
of a wide variety of languages.  A resulting aspect of metascripts
is that they contain far more characters than are needed for any
single use of the script.  I find the concept useful in explaining why
I have made certain decisions in my proposal, but would prefer
to use a standard term for the concept if there is one.





Re: ISO 15924 draft fixes

2004-05-19 Thread Michael Everson
At 01:26 +0200 2004-05-20, Philippe Verdy wrote:
I note also that the list of change (the HTML file in your archive) does not
include the change of orthograph in English names for consonnants with dots
below (such as malalayam). As this ISO-15924 standard should make the English
and French names unambiguous, their orthograph is important.
I understand that there are many problems with the online files; I 
made a comparison only with the plain-text files, and Malayalam was 
not spelled differently in that file, so I judged it irrelevant to 
the task of correcting the basic database.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: ISO 15924 draft fixes

2004-05-19 Thread Michael Everson
At 01:08 +0200 2004-05-20, Philippe Verdy wrote:
I see some differences
- For Georgian, your new file contains only:
Georgian (Mkhedruli);Geor;240;géorgien (mkhédrouli);Georgian;2004-05-18
But the previous version also contained in one of the online tables:
Georgian (Asomtavruli);Geoa;242;géorgien 
(assomtavrouli);Georgian;2004-01-05
That's correct. Asomtavruli has been deleted for now.
- Where is this line?:
Syloti Nagri;Sylo;316;sylotî nâgrî;;2004-09-01
A new script? Oh, it's in the old file and not in 
the new one? It, Coptic, and Phags-pa need to be 
in the list (they are all under ballot).

Limbu has been adjusted to a more appropriate numeric code within South-Asian
scripts (401 to 336).
Error corrected.
I also think that the removal of duplicate rows for English or French name
aliases was a good decision (after all the aliases are already listed between
parentheses).
No, it would allow a huge number of aliases. 
People can search the online files with command-F 
or control-F.

I also think that slpitting the line for the start end end codes
of private scripts was a good idea.
It wasn't mine. I forget whose it was, but it 
makes the tables print more nicely.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: ISO 15924 draft fixes

2004-05-19 Thread Philippe Verdy
I note also that the list of change (the HTML file in your archive) does not
include the change of orthograph in English names for consonnants with dots
below (such as malalayam). As this ISO-15924 standard should make the English
and French names unambiguous, their orthograph is important.

- Original Message - 
From: "Michael Everson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 10:40 PM
Subject: ISO 15924 draft fixes


> The Registrar wishes to thank everyone who has taken an interest in
> the ISO 15924 data pages, and regrets the imperfections which are
> contained there. I am not sure how we will manage the generation of
> the pages, but it is clear that the base should be the plain-text
> document.
>
> I have made changes to the plain-text document and placed it, a draft
> Changes page, and the original plain-text document available at
> http://www.unicode.org/iso15924/iso15924-fixes.zip




Re: problems in Public Review 33 UTF Conversion Code Update

2004-05-19 Thread Philippe Verdy



From: Frank Yung-Fong Tang wrote:
> It should be:
> Legal UTF-8 sequences are:
> 1st 2nd 3rd 4th 
Codepoints---
> 
00-7F   
  -  007F
> C2-DF   
80-BF   
  0080-  07FF
> E0    
  A0-BF   
80-BF     
0800-  0FFF
> E1-EC   80-BF   
80-BF     
1000-  CFFF
> ED  
80-9F   
80-BF     
D000-  D7FF
> EE-EF   80-BF   
80-BF     
E000-  
> F0  
90-BF   80-BF   80-BF    1- 
3
> F1-F3   80-BF   
80-BF   80-BF    4- F
> F4  
80-8F   80-BF   80-BF   10-10
 
However I feel it's not legal (or really not 
recommanded) to encode non-character codepoints xFFFE-x where x is any 
plane number. So the rules need to be a bit more detailed to exclude 
them.
 
Are these permanently assigned non-characters 
encodable in any UTF or in CESU-8?
 


Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread John Hudson
Michael Everson wrote:
 There are already encodings
 suitable for all varieties of Northwest Semitic
 scripts.  One can legitimately argue, as some have,
 that there are still some problems with the Hebrew
 and Syriac encodings, but not that we need anything
 more for the other NW Semitic languages other than
 some nice FONTS!

Which would not address the plain-text requirement to distinguish the 
scripts qua scripts.
Michael, can you briefly outline the points regarding this 'requirement'? The only one 
that has been repeatedly referred to in this too-long discussion is the Tetragrammaton 
usage; I'm not sure whether that constitutes a requirement for plain-text or not. What are 
the other points?

In discussions of whether to encode individual characters/glyphs -- and now, it seems, 
scripts/styles --, much seems to be made of whether there is a requirement to make a 
distinction in plain-text, while the question of whether there is a requirement to use 
plain-text in the first place gets asked less often.*

*Except by Jony, who is always encouraging us to use markup to make distinctions.
John Hudson



Re: ISO 15924 draft fixes

2004-05-19 Thread Philippe Verdy
I see some differences

- For Georgian, your new file contains only:
Georgian (Mkhedruli);Geor;240;géorgien (mkhédrouli);Georgian;2004-05-18
But the previous version also contained in one of the online tables:
Georgian (Asomtavruli);Geoa;242;géorgien (assomtavrouli);Georgian;2004-01-05

- Where is this line?:
Syloti Nagri;Sylo;316;sylotî nâgrî;;2004-09-01

Limbu has been adjusted to a more appropriate numeric code within South-Asian
scripts (401 to 336).

I also think that the removal of duplicate rows for English or French name
aliases was a good decision (after all the aliases are already listed between
parentheses). I also think that slpitting the line for the start end end codes
of private scripts was a good idea.

- Original Message - 
From: "Michael Everson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 10:40 PM
Subject: ISO 15924 draft fixes


> The Registrar wishes to thank everyone who has taken an interest in
> the ISO 15924 data pages, and regrets the imperfections which are
> contained there. I am not sure how we will manage the generation of
> the pages, but it is clear that the base should be the plain-text
> document.
>
> I have made changes to the plain-text document and placed it, a draft
> Changes page, and the original plain-text document available at
> http://www.unicode.org/iso15924/iso15924-fixes.zip
>
> I would appreciate it if interested persons could look this over and
> inform me if they find any further discrepancies between the two
> which are worth troubling about. Then we will proceed to generate the
> other files.
>
> I deleted some duplicate lines: Ethiopic was on two lines, under
> Ethiopic and under Ge'ez. It seemed inappropriate to burden the
> tables with such duplication.
>
> I added Coptic unilaterally.
> -- 
> Michael Everson * * Everson Typography *  * http://www.evertype.com
>




RE: [BULK] - Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread Mike Ayers
Title: RE: [BULK] - Re: Response to Everson Phoenician and why June 7?






> Yer ol' pal,
>  Youtie


    The real question here is "what took you so long"?



/|/|ike





Response to Everson Phoenician and why June 7?

2004-05-19 Thread E. Keown
   Elaine Keown
   Tucson

Hi,

I include below the response of 
Prof. Stephen A. Kaufman, one of the world's most
famous Aramaists, to the Everson Phoenician proposal:

Dr. Stephen A. Kaufman wrote (on the ANE list
recently):

> Anyone who thinks there has to be a separate 
> encoding for Phoenician either does not understand 
> Unicode or (and probably "and") does not understand
> what a glyph is.  There are already encodings 
> suitable for all varieties of Northwest Semitic 
> scripts.  One can legitimately argue, as some have,
> that there are still some problems with the Hebrew 
> and Syriac encodings, but not that we need anything
> more for the other NW Semitic languages other than
>some nice FONTS!
>
>Steve Kaufman 

Why did Debbie suggest June 7 as a the latest date for
responses?  

Elaine




__
Do you Yahoo!?
SBC Yahoo! - Internet access at a great low price.
http://promo.yahoo.com/sbc/



Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread Michael \(michka\) Kaplan
I would respecfully suggest that Dr. Stephen A. Kaufman will need to come up
with a more convincing or (and probably and) professional argument than this
one if he wants it to be taken seriously by people who have a very good
understanding of both Unicode and glyphs, and who further have a serious set
of requirements that suggest that Dr. Kaufman's needs may be the same as the
needs of others who would like the script to be encoded.

I doubt neither Dr. Kaufman's expertise nor reputation, but it is clear that
the actual stated requirements have not been discussed, nor has any specific
problem inherent in the encoding been stated by him. He should consider that
if on one side sits convincing arguments and on the other side sits his
brief posting that it is unlikely that his words will sway the committee.


MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies


- Original Message - 
From: "E. Keown" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; "Deborah W. Anderson" <[EMAIL PROTECTED]>
Cc: "John Cowan" <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 1:54 PM
Subject: Response to Everson Phoenician and why June 7?


>Elaine Keown
>Tucson
>
> Hi,
>
> I include below the response of
> Prof. Stephen A. Kaufman, one of the world's most
> famous Aramaists, to the Everson Phoenician proposal:
>
> Dr. Stephen A. Kaufman wrote (on the ANE list
> recently):
>
> > Anyone who thinks there has to be a separate
> > encoding for Phoenician either does not understand
> > Unicode or (and probably "and") does not understand
> > what a glyph is.  There are already encodings
> > suitable for all varieties of Northwest Semitic
> > scripts.  One can legitimately argue, as some have,
> > that there are still some problems with the Hebrew
> > and Syriac encodings, but not that we need anything
> > more for the other NW Semitic languages other than
> >some nice FONTS!
> >
> >Steve Kaufman
>
> Why did Debbie suggest June 7 as a the latest date for
> responses?
>
> Elaine
>
>
>
>
> __
> Do you Yahoo!?
> SBC Yahoo! - Internet access at a great low price.
> http://promo.yahoo.com/sbc/
>
>




RE: Response to Everson Phoenician and why June 7?

2004-05-19 Thread Mike Ayers
Title: RE: Response to Everson Phoenician and why June 7?






> > Anyone who thinks there has to be a separate 
> > encoding for Phoenician either does not understand 
> > Unicode or (and probably "and") does not understand
> > what a glyph is.


    Was this meant to be a joke?



/|/|ike





Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread Youtie Effaight
Golly gee, all this Phoenecianan talk just makes me wanna sing & dance!
Yee-Haw!
Oh Lord let me flog yet another dead horse
I ain't got a life so I love it of course
Just hand me a whip and I will be so glad
So lord let me flog yet another dead horse!

Yer ol' pal,
Youtie
_
FREE pop-up blocking with the new MSN Toolbar – get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/




Re: Response to Everson Phoenician and why June 7?

2004-05-19 Thread Michael Everson
At 13:54 -0700 2004-05-19, E. Keown wrote:
I include below the response of Prof. Stephen A. Kaufman, one of the 
world's most famous Aramaists, to the Everson Phoenician proposal:
I had seen his contribution already.
 > Anyone who thinks there has to be a separate
 encoding for Phoenician either does not understand
 Unicode or (and probably "and") does not understand
 > what a glyph is.
I am not in the least bit chastened or chagrined by this.
 > There are already encodings
 > suitable for all varieties of Northwest Semitic
 scripts.  One can legitimately argue, as some have,
 that there are still some problems with the Hebrew
 and Syriac encodings, but not that we need anything
 more for the other NW Semitic languages other than
 > some nice FONTS!
Which would not address the plain-text requirement to distinguish the 
scripts qua scripts.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Response to Everson Phoenician and why June 7?

2004-05-19 Thread Rick McGowan
Elaine asked:

> Why did Debbie suggest June 7 as a the latest date for
> responses?

Probably because that is the deadline for documents to be submitted for  
consideration at the upcoming UTC meeting. The issue will be discussed  
there, so anyone who wants to get their input into that meeting should do  
it soon.

Rick



ISO 15924 draft fixes

2004-05-19 Thread Michael Everson
The Registrar wishes to thank everyone who has taken an interest in 
the ISO 15924 data pages, and regrets the imperfections which are 
contained there. I am not sure how we will manage the generation of 
the pages, but it is clear that the base should be the plain-text 
document.

I have made changes to the plain-text document and placed it, a draft 
Changes page, and the original plain-text document available at 
http://www.unicode.org/iso15924/iso15924-fixes.zip

I would appreciate it if interested persons could look this over and 
inform me if they find any further discrepancies between the two 
which are worth troubling about. Then we will proceed to generate the 
other files.

I deleted some duplicate lines: Ethiopic was on two lines, under 
Ethiopic and under Ge'ez. It seemed inappropriate to burden the 
tables with such duplication.

I added Coptic unilaterally.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Vertical BIDI

2004-05-19 Thread Timothy Partridge
Philippe Verdy recently said:

> From: <[EMAIL PROTECTED]>

> > What's uncertain is whether a lr or a rl progression is favored, given the
> > paucity of evidence.  Michael favors lr progression.  There is no question
> > that the text is read BTT.

> This creates an interesting problem: Put in the same sentence Han (Chinese) and
> Mongolian words in a vertical layout (I don't think this is unlikely, as
> Mongolian is also spoken in China, and there's also a Chinese community in
> Mongolia). So Chinese ideographs will be laid out vertically from top to bottom
> (but not rotated, except for a few characters like ideographic punctuation marks
> or symbols), and Mongolian will be laid out from bottom to top in their normal
> stack orientation. Such a text is clearly bidirectional, so we would need BiDi
> processing to order glyphs correctly.

John's comment refers to Ogham. Mongolian goes top to bottom.

> Now try including some Latin words in this text (also not unlikely: there are
> lots of trademarks and people names that will need to be written with their
> normal Latin characters). If the text is presented vertically, there's a
> legitimate question of whever Latin should be rotated (but it will keep the Han
> flow direction.)

Latin and Cyrillic are rotated 90 degrees clockwise when mixed with
Mongolian in vertical lines. Presumably Arabic would be rotated 90 degrees
anti-clockwise. (The ancestor of Mongolian was which is why the vertical
lines go left to right.) One amusing aspect is that punctuation like ? and !
stay vertical at the end of Mongolian sentances, but are rotated at the end
of Latin and Cyrillic ones.

Mongolian is somewhat unusual in that nowadays when it is written in
horizontal lines, it is rotated a further 90 degrees so it goes left to
right and is upside down compared to the ancestral script.

   Tim

-- 
Tim Partridge. Any opinions expressed are mine only and not those of my employer




Re: ISO 15924 codes for ConScript

2004-05-19 Thread Anto'nio Martins-Tuva'lkin
On 2004.05.19, 06:23, Doug Ewell <[EMAIL PROTECTED]> wrote:

> For those who like ISO 15924 script codes and LOVE the Unicode
> Private Use Area -- you know who you are -- check out my list of
> proposed ISO 15924 private-use codes for the ConScript Unicode
> Registry:
>
> http://users.adelphia.net/~dewell/conscript-15924.html

Great, but wouldn't "Qaas" (918; Seussian Latin Extensions) be rather
classified as Latn?

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




problems in Public Review 33 UTF Conversion Code Update

2004-05-19 Thread Frank Yung-Fong Tang




Looking at
http://www.unicode.org/review/


  

  33
   UTF Conversion
Code Update
  2004.06.08


  The C
language source code example for UTF conversions (ConverUTF.c) has been
updated to version 1.2 and is being released for public review and
comment. This update includes fixes for several minor bugs. The code
can be found at the above link.

  


and look at the code
under http://www.unicode.org/Public/BETA/CVTUTF-1-2/

In http://www.unicode.org/Public/BETA/CVTUTF-1-2/ConvertUTF.c

/* * Index into the table below with the first byte of a UTF-8 sequence to * get the number of trailing bytes that are supposed to follow it. */static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5};although there are code prevent 5-6 bytes UTF-8 sequence. The array above mislead people to think there are 5 and 6 bytes UTF-8. Also, F5-F7 should not map to 3. C0 and C1 !
 should not map to 1It should be change to static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,0,0,0,0,0,0,0,0,0,0,0};/* * Once the bits are split out into bytes of UTF-8, this is a mask OR-ed * into the first byte, depending on how many bytes follow.  There are * as many entries in this table as there are UTF-8 sequence types. * (I.e., one byte sequence, two byte... six byte sequence.!
 ) */static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0

xE0, 0xF0, 0xF8, 0xFC };
This comment is also
misleading "six byte sequence" and "0xF8, 0xFC"

/* Figure out how many bytes the result will require */ if (ch < (UTF32)0x80) {  bytesToWrite = 1;  } else if (ch < (UTF32)0x800) { bytesToWrite = 2;   } else if (ch < (UTF32)0x1) {   bytesToWrite = 3;   } else if (ch < (UTF32)0x20) {  bytesToWrite = 4;Shouldn't the last line be } else if (ch < (UTF32)0x11) {  bytesToWrite = 4;? where does the 0x20 come from ?  switch (extraBytesToRead) { case 5: ch += *source++; ch <<= 6;  case 4: ch += *source++; ch <<= 6;This code also mislead people to think there are 5 and 6 bytes UTF-8 sequenceAlso the following routinestatic Boolean isLegalUTF8(const UTF8 *source, int length) {UTF8 a;const UTF8 *srcptr = source+length;switch (length) {default: return false;  /* Everything else falls through when "true"... */case 4: if ((a = (*--srcptr)) < 0x80 || a > 0!
 xBF) return false;case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;case 2: if ((a = (*--srcptr)) > 0xBF) return false; switch (*source) {  /* no fall-through in this inner switch */  case 0xE0: if (a < 0xA0) return false; break;   case 0xF0: if (a < 0x90) return false; break;   case 0xF4: if (a > 0x8F) return false; break;   default:  if (a < 0x80) return false;   }   case 1: if (*source >= 0x80 && *source < 0xC2) return false;if (*source > 0xF4) return false;}return true;}Does NOT match the table 3.1B as defined in Unicode 3.2see  http://www.unicode.org/reports/tr28/#3_1_conformanceor Table 3-6 Well-Formed UTF-8 Byte Sequences in page 78 of Unciode 4.0in particular the function treat the following range legal!
  whileit should NOTU+D800..U+DFFF ED A0-BF 80-BFAl

so http://www.unicode.org/Public/BETA/CVTUTF-1-2/harness.cThe following comment is misleading/* - test01 - Spot check a few legal & illegal UTF-8 values only.This is not an exhaustive test, just a brief one that was   used to develop the "isLegalUTF8" routine.  Legal UTF-8 sequences are:  1st 2nd 3rd 4th Codepoints---   00-7F -  007F   C2-DF   80-BF 0080-  07FF   E0  A0-BF   80-BF 0800-  0FFF   E1-EF   80-BF   80-BF 1000-     F0  90-BF   80-BF   80-BF1- 3   F1-F3   80-BF   80-BF   80-BF4- F   F4  80-8F   80-BF   80-BF   10-10   - */It should be

RSS newsfeed for Alan Wood's Unicode Resources

2004-05-19 Thread Alan Wood
Until now, it has not been easy to find new entries for fonts and programs
in my collection of Unicode resources, so I have implemented a newsfeed:

http://www.alanwood.net/news/unicode.rss

More information about the feed can be found at:

http://www.alanwood.net/news/index.html

I hope you will find it useful.

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)



Re: Vertical BIDI

2004-05-19 Thread John Cowan
Andrew C. West scripsit:

> The only thing that is certain is that Ogham must be rendered BTT in
> vertical contexts. For Ogham text in isolation this is fairly easy to
> accomplish by simple rotation, and one could expect "writing-mode
> : bt-rl" or "writing-mode : bt-lr" to accomplish this in a CSS
> stylesheet. Whether the columns should run LTR or RTL across the page
> is another question, although LTR would be simplest to implement as
> it would only involve rotating a whole block of horizontal LTR Ogham
> text 90 degrees anticlockwise. At any rate, vertical presentation is
> a matter for a higher protocol, and not a Unicode matter.

I think it's clear by now that bt-lr is the Right Thing.  (A great pity
that the Irish monks didn't record horizontal Ogham RTL!  If you are
standing in front of an Ogham-inscribed archway, the curve of the text
does pass from your right side to your left side (and the same for a
standing stone if you in imagination flatten out the sides), and the
monks must have had *some* familiarity with Hebrew or Arabic.)

> However, Ogham text embedded in Mongolian may be a different matter. If
> a plain text editor renders everything horizontally, as most do, then
> both Mongolian and Ogham should be rendered LTR thus  mongolian>, but if you then select vertical presentation (assuming
> your text editor has this option) Mongolian should be rendered TTB and
> Ogham BTT thus .  I still have no idea as
> to how this should be achieved. My "hack" of using a custom rotated
> Ogham font and RLO/PDF codes would achieve the desired result for
> vertical presentation, but would make the Ogham text RTL for horizontal
> presentation, which is apparently unacceptable. But what alternatives
> are there ?

To introduce a concept of bidi override into stylesheet languages.
You need something like this anyway to handle the case of lr Latin
with embedded Han, where the Latin reads BTT and the Han reads TTB.

Fundamentally, vertical scripts like Han and Mongolian and Ogham have
an essential vertical directionality and a preferred horizontal one
(but they can sometimes tolerate the other direction: RTL Han is not
unknown).  Horizontal scripts have an essential horizontal directionality
and may or may not have a preferred vertical one.

-- 
Long-short-short, long-short-short / Dactyls in dimeter,
Verse form with choriambs / (Masculine rhyme):  [EMAIL PROTECTED]
One sentence (two stanzas) / Hexasyllabically   http://www.reutershealth.com
Challenges poets who / Don't have the time. --robison who's at texas dot net



Re: Vertical BIDI

2004-05-19 Thread Philippe Verdy
From: "John Cowan" <[EMAIL PROTECTED]>
> The difficulty arises when Ogham is mixed with vertical Han or with
> Mongolian, since once the basic directionality becomes vertical, the
> tendency to read the Ogham BTT will become automatic.  This is analogous
> to the problem that fantasai has pointed out with Latin script written
> in lr progression when Han gets mixed in: the normal reading direction
> of lr-Latin is BTT, but any Han included will automatically be read TTB,
> corrupting it.

corrupting is probably a bad term here. Latin vertical text is _often_ written
by rotating it 90 degrees counterclockwise (same rotation direction for angled
presentation at 45 degrees, commonly found in the header row of tables with many
narrow columns), so that it reads bottom to top. But the clockwise rotation is
also possible (commonly found in the footer row of tables with many narrow
columns).

For Latin, the rotation of the baseline is a matter of style. In Han or Kana
texts, occurences of Latin can occur in either direction, but with different
baseline orientation.

Less often (?), the baseline of Latin glyphs is not rotated but glyphs are put
one below the previous one like in crosswords (will happen mostly for
uppercase-only style, as this style is horrible with lowercase letters). This
presentation would be consistent with traditional vertical Han presentation
(where glyphs are keeping their horizontal baseline, without being rotated); it
may be ideal for small inclusions of Latin in Han texts, however it is inadapted
for the cursive handwritten form, where the writer would probably turn his paper
90 degrees counterclockwise for writing it.

Latin is quite permissive for the rotation of its glyphs, because the baseline
orientation is very easy to figure out without ambiguities for readers. This is
not true for Ogham where you need to know the language to see in which direction
the characters must be read and interpreted.




Re: Vertical BIDI

2004-05-19 Thread Andrew C. West
Michael Everson wrote:
>
> Come on, people. Read the standard, please. It's on page 338. 

Michael is absolutely right to rebuke me for not reading the Standard. Of course
I have read the Ogham block intro before, and no doubt that is where I got the
notion of rendering Ogham BTT from, but I had forgotten that Ogham's BTT
directionality is explicitly mentioned there. If only I had reread the block
intro before joining this thread I wouldn't have ended up rambling down a dead
end in my recent postings.

But now that I'm back on the marked path the way forward is still as unclear as
ever.

The only thing that is certain is that Ogham must be rendered BTT in vertical
contexts. For Ogham text in isolation this is fairly easy to accomplish by
simple rotation, and one could expect "writing-mode : bt-rl" or "writing-mode :
bt-lr" to accomplish this in a CSS stylesheet. Whether the columns should run
LTR or RTL across the page is another question, although LTR would be simplest
to implement as it would only involve rotating a whole block of horizontal LTR
Ogham text 90 degrees anticlockwise. At any rate, vertical presentation is a
matter for a higher protocol, and not a Unicode matter.

However, Ogham text embedded in Mongolian may be a different matter. If a plain
text editor renders everything horizontally, as most do, then both Mongolian and
Ogham should be rendered LTR thus , but if you then
select vertical presentation (assuming your text editor has this option)
Mongolian should be rendered TTB and Ogham BTT thus .
I still have no idea as to how this should be achieved. My "hack" of using a
custom rotated Ogham font and RLO/PDF codes would achieve the desired result for
vertical presentation, but would make the Ogham text RTL for horizontal
presentation, which is apparently unacceptable. But what alternatives are there ?

Andrew



Re: Vertical BIDI

2004-05-19 Thread John Cowan
Philippe Verdy scripsit:

> > In fact no; both Mongolian (or Manchu, which is unified with it in
> > Unicode) and Chinese are written TTB.
> 
> Then, why did you say that:
> 
> > What's uncertain is whether a lr or a rl progression is favored,
> > given the paucity of evidence.  Michael favors lr progression.
> > There is no question that the text is read BTT.

That statement refers to Ogham, not Mongolian!

Ogham carved on stone is read up one side of the stone, then (if
necessary) across the top of the stone, then (if necessary) down the
other side of the stone.  Now maybe it's just a mistake to assimilate
this scheme to any kind of two-dimensional layout, since all known
instances of Ogham on manuscript are ordinary horizontal L2R, like Latin
(with which it is most often mixed).

The difficulty arises when Ogham is mixed with vertical Han or with
Mongolian, since once the basic directionality becomes vertical, the
tendency to read the Ogham BTT will become automatic.  This is analogous
to the problem that fantasai has pointed out with Latin script written
in lr progression when Han gets mixed in: the normal reading direction
of lr-Latin is BTT, but any Han included will automatically be read TTB,
corrupting it.

*sigh*

One of my favorite lines in the Unicode Standard reads:  "There simply
is no traditional Japanese method of typesetting Devanagari."

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
There are books that are at once excellent and boring.  Those that at
once leap to the mind are Thoreau's Walden, Emerson's Essays, George
Eliot's Adam Bede, and Landor's Dialogues.  --Somerset Maugham