Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-11 Thread QSJN 4 UKR
Prime for soft sign transliteration used to avoid ambiguty: apostroph
is used for apostroph itself, common sign in Ukrainian or Belarusian.


Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-11 Thread Konstantin Ritt
In Ukrainian, for example, both “ь” and “`” are used.
“ь” is used for softer pronounce of the preceding consonant ( тіньовий ),
whilst “`” is used for splitting them, like if they were the first letter
in a word, even when the next vowel sounds soft otherwise ( пом`якшення --
the last “я” sounds softer the former one ).

Regards,
Konstantin

2016-02-11 18:05 GMT+04:00 QSJN 4 UKR :

> I can show an example of use both, prime (as soft sign) and apostroph
> (hemisoft) in Cyrilic-based phonetic transcription (Orthoepic
> Dictionary of Ukrainian, http://padaread.com/?book=84816=6
> http://padaread.com/?book=84816=7)
>


Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-11 Thread QSJN 4 UKR
I can show an example of use both, prime (as soft sign) and apostroph
(hemisoft) in Cyrilic-based phonetic transcription (Orthoepic
Dictionary of Ukrainian, http://padaread.com/?book=84816=6
http://padaread.com/?book=84816=7)


Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-11 Thread Asmus Freytag (t)

  
  
On 2/11/2016 6:05 AM, QSJN 4 UKR wrote:


  I can show an example of use both, prime (as soft sign) and apostroph
(hemisoft) in Cyrilic-based phonetic transcription (Orthoepic
Dictionary of Ukrainian, http://padaread.com/?book=84816=6
http://padaread.com/?book=84816=7)



Can you give the number of the entry on that
  page? I've found the prime, but I do not see an apostrophe. What I
  see is a combining apostrophe (similar to the way CARON is
  rendered as a raised comma, when following "d"). 
  
  A./

  



RE: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-09 Thread Martin Heijdra
And so it is, also in the library world both before and after Unicode: for 
miagkii znak the prime is prescribed. The prime is also prescribed for some 
uses for standard transliteration in Tibetan and Hebrew/Arabic/Persian/Pushto:



See:e.g. the relevant tables on https://www.loc.gov/catdir/cpso/roman.html:

Tibetan: When two full forms of letters are stacked, as in Sanskritized 
Tibetan, there is no need to indicate the stacking. However, in the two cases 
noted here a modified letter prime should be inserted between the two 
consonants for the purpose of disambiguation.

ཏྶ་


tʹsa


ཙ་


tsa


ནྱ་


nʹya


ཉ་


nya



Hebrew: A single prime ( ʹ ) is placed between two letters representing two 
distinct consonantal sounds when the combination might otherwise be read as a 
digraph.

hisʹhid



Persian: When the affix and the word with which it is connected grammatically 
are written
separately in Persian, the two are separated in romanization by a single prime
( ʹ ).

khānahʹhā





Martin Heijdra



-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Michael Everson
Sent: Tuesday, February 09, 2016 8:43 AM
To: Unicode Discussion
Subject: Re: transliteration of mjagkij znak (Cyrillic soft sign)



On 9 Feb 2016, at 05:31, Asmus Freytag (t) 
<asmus-...@ix.netcom.com<mailto:asmus-...@ix.netcom.com>> wrote:



> Without scouring the book I don't know whether there's another place in it 
> where something's unquestioningly the prime. In that case we could figure out 
> whether its appearance is simply the way that font does it. Alternatively, if 
> making double prime look different from two single primes, perhaps that's 
> common enough across fonts, and would help to lay any doubts to rest -   but 
> so far, what I see is a spacing acute.



Well, Asmus, it isn’t one. We linguists have been taught it’s the prime. 
https://en.wikipedia.org/wiki/Prime_(symbol)#Use_in_linguistics





Michael Everson * http://www.evertype.com/






Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-09 Thread Michael Everson
On 9 Feb 2016, at 05:31, Asmus Freytag (t)  wrote:

> Without scouring the book I don't know whether there's another place in it 
> where something's unquestioningly the prime. In that case we could figure out 
> whether its appearance is simply the way that font does it. Alternatively, if 
> making double prime look different from two single primes, perhaps that's 
> common enough across fonts, and would help to lay any doubts to rest -   but 
> so far, what I see is a spacing acute.

Well, Asmus, it isn’t one. We linguists have been taught it’s the prime. 
https://en.wikipedia.org/wiki/Prime_(symbol)#Use_in_linguistics


Michael Everson * http://www.evertype.com/




Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-08 Thread Asmus Freytag (t)

  
  
On 2/8/2016 5:47 PM, Michael Everson
  wrote:


  It’s what I was taught as the scientific romanization for Russian and Slavic in general. 

Michael Everson * http://www.evertype.com/





Source?
  
  A./

  



Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-08 Thread Asmus Freytag (t)

  
  
On 2/8/2016 6:39 PM, Charlie Ruland
  wrote:


  
  Am 09.02.2016 schrieb Asmus Freytag (t):
  

On 2/8/2016 5:47 PM, Michael
  Everson wrote:


  It’s what I was taught as the scientific romanization for Russian and Slavic in general. 

Michael Everson * http://www.evertype.com/





Source?
  
  A./
 
  
  Look at tables 27.1 (p. 348) and 27.2 (p. 351) of Paul Cubberley’s
  The Slavic Alphabets (=Peter T. Daniels and William Bright
  (eds.): The Word’s Writing Systems, pp. 346–355).
  Obviously the soft sign <ь> is transliterated as a prime
  <ʹ>, and the hard sign <ъ> as a double prime
  <ʺ>. Also note that <ћ> [gʲ] is Romanized as <ǵ>
  which can hardly be considered an apostrophe above .
  

I looked.

The <ǵ> looks like a g-acute. However, the "ink" for that
acute matches the ink for the prime for <ь>, which is
otherwise at the wrong angle compared to the double prime. (Does not
look like one half of the double prime - the slight difference in
weight would be more typical of single/double symbols).

Without scouring the book I don't know whether there's another place
in it where something's unquestioningly the prime. In that case we
could figure out whether its appearance is simply the way that font
does it. Alternatively, if making double prime look different from
two single primes, perhaps that's common enough across fonts, and
would help to lay any doubts to rest -   but so far, what I see is a
spacing acute.

A./
  



Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-08 Thread Michael Everson
It’s what I was taught as the scientific romanization for Russian and Slavic in 
general. 

Michael Everson * http://www.evertype.com/




Re: transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-08 Thread Charlie Ruland

Am 09.02.2016 schrieb Asmus Freytag (t):

On 2/8/2016 5:47 PM, Michael Everson wrote:

It’s what I was taught as the scientific romanization for Russian and Slavic in 
general.

Michael Everson *http://www.evertype.com/




Source?

A./


Look at tables 27.1 (p. 348) and 27.2 (p. 351) of Paul Cubberley’s /The 
Slavic Alphabets/ (=Peter T. Daniels and William Bright (eds.): /The 
Word’s Writing Systems/, pp. 346–355). Obviously the soft sign <ь> is 
transliterated as a prime <ʹ>, and the hard sign <ъ> as a double prime 
<ʺ>. Also note that <ћ> [gʲ] is Romanized as <ǵ> which can hardly be 
considered an apostrophe above .


Charlie


transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-08 Thread Otto Stolz

Hello,

I am wondering how U+02B9 MOFIFIER LETTER PRIME made
its way into the Unicode repertoire, and how it
acquired its comment “transliteration of mjagkij znak
(Cyrillic soft sign: palatalization)“.

ISO/R 9:1954 through ISO/R 9:1986 map the mjagkij znak
“ь” to the apostrophe, and so does DIN 1460:1982. The latter
clearly depicts the apostrophe that later became U+02BC,
while I am not sure whether also ISO/R 9 does so or rather
depicts a glyph like U+0027. (All of these standards
predate Unicode, so they just depict glyphs.)

ISO/R 9:1995 maps the mjagkij znak “ь” to the prime,
particularly to the modifier letter U+02B9, in accordance
with the comment in the Unicode charts.

Unicode archeologists, can you shed some light on the
history of both U+02B9 and the mjagkij znak?

And linguists, can you tell me how the mjagkij znak is
transliterated normally, as an apostrophe or as a prime?

Thanks for any comments,
  Otto



Precomposed Cyrillic letters

2015-07-09 Thread Doug Ewell
 From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf,
Addition of two letters from Montenegrin language, CYRILLIC script:

 9. Can any of the proposed characters be encoded using a composed
 character sequence of either existing characters or other proposed
 characters?
 No

Saying it doesn't make it so:

 Annex 1: Character shapes (related to section B, item 4b)
 Cyrillic small letter SJ
 с́

0441 0301

 Cyrillic capital letter SJ
 С́

0421 0301

 Cyrillic small letter ZJ
 з́

0437 0301

 Cyrillic capital letter ZJ
 З́

0417 0301

Quite a few fonts don't display these well (and quite a few do), but of
course that's a font problem, not an encoding problem.

Cf. http://www.unicode.org/faq/char_combmark.html#11


--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Precomposed Cyrillic letters

2015-07-09 Thread Markus Scherer
On Thu, Jul 9, 2015 at 8:53 AM, Doug Ewell d...@ewellic.org wrote:

  From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf,
 Addition of two letters from Montenegrin language, CYRILLIC script:

  9. Can any of the proposed characters be encoded using a composed
  character sequence of either existing characters or other proposed
  characters?
  No

 Saying it doesn't make it so:


Right, although I doubt that the proposers monitor this mailing list...

In case an interested party is listening: If sr-ME needs different locale
data than sr, then one could contribute such data to CLDR
http://cldr.unicode.org/.
See the current state:
http://unicode.org/cldr/trac/browser/trunk/common/main/sr_Cyrl_ME.xml

markus


Re: Precomposed Cyrillic letters

2015-07-09 Thread Richard Wordingham
On Thu, 9 Jul 2015 09:37:21 -0700
Markus Scherer markus@gmail.com wrote:

 On Thu, Jul 9, 2015 at 8:53 AM, Doug Ewell d...@ewellic.org wrote:
 
   From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf,
  Addition of two letters from Montenegrin language, CYRILLIC
  script:
 
   9. Can any of the proposed characters be encoded using a composed
   character sequence of either existing characters or other proposed
   characters?
   No
 
  Saying it doesn't make it so:

Is there a requirement to answer those questions truthfully?

 Right, although I doubt that the proposers monitor this mailing
 list...
 
 In case an interested party is listening: If sr-ME needs different
 locale data than sr, then one could contribute such data to CLDR
 http://cldr.unicode.org/.
 See the current state:
 http://unicode.org/cldr/trac/browser/trunk/common/main/sr_Cyrl_ME.xml

Presumably http://cldr.unicode.org/index/survey-tool/accounts is the
most relevant page for someone with credibility.  However, as
Montenegro has an army and a navy, you have the wrong locale.  It's
still waiting for a language code.  See the language family panels
at https://en.wikipedia.org/wiki/Eastern_Herzegovinian_dialect and
https://en.wikipedia.org/wiki/Montenegrin_language for the extreme
Balkanisation.

But in short, yes we need the extra Cyrillic letters с́ and з́  and
Latin letters ś and ź for the exemplar characters in sr_Cyrl_ME and
sr_Latn_ME (or should that be sr_ME?).  I can't work out the status of
Montenegrin Latin {sj} and {zj}.

Richard.




Re: Precomposed Cyrillic letters

2015-07-09 Thread Doug Ewell
Richard Wordingham richard dot wordingham at ntlworld dot com wrote:

 Presumably http://cldr.unicode.org/index/survey-tool/accounts is the
 most relevant page for someone with credibility. However, as
 Montenegro has an army and a navy, you have the wrong locale. It's
 still waiting for a language code. See the language family panels
 at https://en.wikipedia.org/wiki/Eastern_Herzegovinian_dialect and
 https://en.wikipedia.org/wiki/Montenegrin_language for the extreme
 Balkanisation.

Montenegro could have all the military power in the world, but that
doesn't make Montenegrin a distinct language. It's a dialect of
Serbian.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Old Cyrillic Yest

2014-02-26 Thread QSJN 4 UKR
2012/11/12 QSJN 4 UKR qsjn4ukr at gmail dot com wrote:

  Old Cyrillic letter YEST (Є) has two variants: broad (also called
  Yakornoye Yest) and narrow. They are saved in modern Ukrainian script
  (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD
  YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW
  YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic
  YEST, but it is unclear, how to distinguish the BROAD YEST and the
  NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and
  U+0415/0435 for the modern rectangle IE, some old-style fonts use only
  the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454
  at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for
  the NARROW YEST...

2012/11/23 Doug Ewell d...@ewellic.org
How many truly different letters, old and new, are we talking about? On 
November 12 you wrote, UKRAINIAN IE and BROAD YEST is the same letter in 
fact. It would not make sense to assign a new BROAD YEST letter if it is 
really the same as UKRAINIAN IE, and if existing texts already use UKRAINIAN 
IE to represent it.


Full picture
Meaning - Glyph - Codepoint
Old ChurchSlavonic:
Narrow Yest (regular form) - very narrow halfmoon - 0404/0454 (ambiguous) and
0415/0435 (probably wrong glyph will be rendered) (there are no
certain codepoints)
Broad Yest (special form, initial, plural disambiguator) - broad
halfmoon, identical
to Ukrainian Ie or maybe somehow grater (broking baseline) - 0404/0454
indeed
Modern imitation of Church Slavonic, or really old texts, or texts
where hard to distinct Broad and Narrow Yest:
Ambiguous Yest - identical to Ukrainian Ie or maybe like Narrow Yest
(in old-style font) -
0404/0454 sure
Modern languages:
Ie - rectangle capital / closed rounded small (identical to Latin) - 0415/0435
Ukrainian Ie - identical to ambiguous Yest - 0404/0454

So there are two steps. First. Required. Separate codepoint for Narrow
Yests. It is just impossible to work with ChurchSlavonic texts without
these. Because: wrong glyph is rendered almost always (you must
understand, we cant hope on language detection, cause the text
contains certain the mix, old text with modern translation) - or -
there is no way to show  Broad Yest at all.
Second. Optional. Separate codepoint for Broad Yests. That's only
necessary if one part of text contains the ambiguous Yests (coded as
now, 0404/0454, without changes!) but other part contains the Broad
Yests and the author can/wants to show this feature.

Am i the only man in the world who think that Unicode is poorly
adapted for ChurchSlavonic?

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Old Cyrillic Yest

2013-01-29 Thread QSJN 4 UKR
2013/1/29 QSJN 4 UKR qsjn4...@gmail.com

 I found something terrible. Sorry, I did not make a photo. That is a
 modern book with [http://litopys.org.ua/smotrgram/sm11.htm]-this text of
 Meletius Smotrytsky Grammar, but a reprint, not a faximile like I refer to.
  Here are the rules about using BROAD YEST and NARROW YEST. Modern
 publisher used GREEK EPSILON and UKRAINIAN IE to show NARROW and BROAD
 YEST. Hah! Try to guess what is what. The funniest is that there are
 examples: тѣм творцєм - тым творцєм (it has to be  тѣм творцεм
 (singular) - тым творцєм (plural) ).

:)  :)  :)
Or vice versa: plural - singular. I didn't get it!


Re: Old Cyrillic Yest

2013-01-29 Thread QSJN 4 UKR
I found something terrible. Sorry, I did not make a photo. That is a modern
book with [http://litopys.org.ua/smotrgram/sm11.htm]-this text of Meletius
Smotrytsky Grammar, but a reprint, not a faximile like I refer to.
Here are the rules about using BROAD YEST and NARROW YEST. Modern publisher
used GREEK EPSILON and UKRAINIAN IE to show NARROW and BROAD YEST. Hah! Try
to guess what is what. The funniest is that there are examples: тѣм
творцєм - тым творцєм (it has to be  тѣм творцεм (singular) - тым творцєм
(plural) ).


Re: Old Cyrillic Yest

2012-11-29 Thread Michael Everson
On 29 Nov 2012, at 08:57, QSJN 4 UKR qsjn4...@gmail.com wrote:

 Yes, maybe, probably. Truly different glyph is the NARROW YEST. Truly special 
 character name has the BROAD YES, YAKORNOYE YEST, while the NARROW as well as 
 the modern UKRAINIAN є is just IE or YEST. Well, I don't know, would you 
 please read the Wikipedia or something: 
 http://ru.wikipedia.org/wiki/Якорное_Е (N. B. There is only one source 
 reference in Wiki article. Dark night!).

There are ways of making a case for disunification. Qsjn 4 Ukr has not made 
them.

Michael Everson * http://www.evertype.com/





Old Cyrillic Yest

2012-11-12 Thread QSJN 4 UKR
Old Cyrillic letter YEST (Є) has two variants: broad (also called
Yakornoye Yest) and narrow. They are saved in modern Ukrainian script
(only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD
YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW
YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic
YEST, but it is unclear, how to distinguish the BROAD YEST and the
NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and
U+0415/0435 for the modern rectangle IE, some old-style fonts use only
the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454
at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for
the NARROW YEST... Please regulate it!
Unicode Standard has some codepoins for other broad Cyrillic letters:
U+A64C/A64D BROAD OMEGA, U+047A/047B ROUND OMEGA (misnomer, it is
broad o). Adding new codepoints for the BROAD YEST does not solve the
problem: as i said, UKRAINIAN IE and BROAD YEST is the same letter in
fact. Adding new codepoints for the NARROW YEST is bad idea too,
existing texts use U+0404/0454 for NARROW YEST more often than for
BROAD YEST (just since broad form is rare:). So we need as many as 4
new codepoints in U+A6xx block for CYRILLIC CAPITAL and SMALL LETTER
BROAD and NARROW YEST. That way we shall be able to use both
discernible letters of the Old Cyrillic, and we shall not mix them
with the modern Ukrainian letters nor each other.




Re: Old Cyrillic Yest

2012-11-12 Thread Leo Broukhis
Telling font designers how to do their job (even if it's within
Unicode's purview which I doubt) by adding new codepoints is a novel
idea to say the least.

Leo

On Mon, Nov 12, 2012 at 3:32 AM, QSJN 4 UKR qsjn4...@gmail.com wrote:
 Old Cyrillic letter YEST (Є) has two variants: broad (also called
 Yakornoye Yest) and narrow. They are saved in modern Ukrainian script
 (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD
 YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW
 YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic
 YEST, but it is unclear, how to distinguish the BROAD YEST and the
 NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and
 U+0415/0435 for the modern rectangle IE, some old-style fonts use only
 the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454
 at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for
 the NARROW YEST... Please regulate it!
 Unicode Standard has some codepoins for other broad Cyrillic letters:
 U+A64C/A64D BROAD OMEGA, U+047A/047B ROUND OMEGA (misnomer, it is
 broad o). Adding new codepoints for the BROAD YEST does not solve the
 problem: as i said, UKRAINIAN IE and BROAD YEST is the same letter in
 fact. Adding new codepoints for the NARROW YEST is bad idea too,
 existing texts use U+0404/0454 for NARROW YEST more often than for
 BROAD YEST (just since broad form is rare:). So we need as many as 4
 new codepoints in U+A6xx block for CYRILLIC CAPITAL and SMALL LETTER
 BROAD and NARROW YEST. That way we shall be able to use both
 discernible letters of the Old Cyrillic, and we shall not mix them
 with the modern Ukrainian letters nor each other.






Re: Old Cyrillic Yest

2012-11-12 Thread Doug Ewell
QSJN 4 UKR qsjn4ukr at gmail dot com wrote:

 Old Cyrillic letter YEST (Є) has two variants: broad (also called
 Yakornoye Yest) and narrow. They are saved in modern Ukrainian script
 (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD
 YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW
 YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic
 YEST, but it is unclear, how to distinguish the BROAD YEST and the
 NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and
 U+0415/0435 for the modern rectangle IE, some old-style fonts use only
 the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454
 at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for
 the NARROW YEST... Please regulate it!

The Unicode Consortium does not regulate this aspect of fonts, nor
should it, except to say that glyphs have to represent the true abstract
character, and not display, say, a B-like glyph at the code point for
the letter A.

If you are saying that Chapter 7.4 of TUS needs a description of these
two abstract characters, that seems fair, but that is as far as the
regulating goes.

 Unicode Standard has some codepoins for other broad Cyrillic letters:
 U+A64C/A64D BROAD OMEGA, U+047A/047B ROUND OMEGA (misnomer, it is
 broad o). Adding new codepoints for the BROAD YEST does not solve the
 problem: as i said, UKRAINIAN IE and BROAD YEST is the same letter in
 fact. Adding new codepoints for the NARROW YEST is bad idea too,
 existing texts use U+0404/0454 for NARROW YEST more often than for
 BROAD YEST (just since broad form is rare:). So we need as many as 4
 new codepoints in U+A6xx block for CYRILLIC CAPITAL and SMALL LETTER
 BROAD and NARROW YEST. That way we shall be able to use both
 discernible letters of the Old Cyrillic, and we shall not mix them
 with the modern Ukrainian letters nor each other.

This would create duplicate encodings for existing text, a Bad Thing. If
this is genuinely a problem, the improved explanation in Chapter 7.4
(above) would be a better solution.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: [indic] Indic Transliteration Standards in Cyrillic Greek

2012-11-11 Thread N. Ganesan
On Sat, Nov 10, 2012 at 12:49 PM, Vinodh Rajan vinodh.vin...@gmail.com
wrote:

 Hi,

 These are several standards for transliterating Indic script to Roman
 characters such as IAST, ISO 15919 etc.

 I would like to know if any similar standards exist for expressing the
 Indic set in Greek  Cyrillic with special diacritics.

 If they do exist, any pointers to their Unicode representations.

 Thanks

 V

 --
 http://www.virtualvinodh.com

Vinodh,

These resources will help:

http://transliteration.eki.ee/pdf/Russian.pdf

http://en.wikipedia.org/wiki/Scientific_transliteration_of_Cyrillic

http://learningrussian.net/pronunciation/transliteration.php

N. Ganesan


Re: [indic] Re: Indic Transliteration Standards in Cyrillic Greek

2012-11-11 Thread N. Ganesan
On Sat, Nov 10, 2012 at 3:02 PM, John Hudson j...@tiro.ca wrote:

 I'm sorry, I misread the original question. I'm not aware of particular
 Cyrillic or Greek transcription systems for Indic scripts or languages. My
 suspicion is that Russian systems exist, given the historic interests of
 Russian linguistic studies. I'm doubtful if Greek systems exist, but would
 be happy to be proven wrong.

 JH


My guess is Vinodh wants to add the capacity from Indic to Cyrillic scripts:

One way will be to see the Latin letters and then convert to Cyrillic,
http://transliteration.eki.ee/pdf/Russian.pdf

But for some letters, say in Tamil, there won't be equivalents in Cyrillic.

N. Ganesan


Indic Transliteration Standards in Cyrillic Greek

2012-11-10 Thread Vinodh Rajan
Hi,

These are several standards for transliterating Indic script to Roman
characters such as IAST, ISO 15919 etc.

I would like to know if any similar standards exist for expressing the
Indic set in Greek  Cyrillic with special diacritics.

If they do exist, any pointers to their Unicode representations.

Thanks

V

-- 
http://www.virtualvinodh.com


Re: Indic Transliteration Standards in Cyrillic Greek

2012-11-10 Thread Philippe Verdy
At least there should exist conventions in all languages to
transliterate in their own script an IPA representation (used as a
central phonetic transcription, where the source languafge would be
noted using its subset of IPA for representing its initial phonology
rather than one particular phonetic realization). Then these
phonologic IPA representations should find a good approximation in the
target (script/language) pair, in order to produce consistant
phonologic transcriptions that are readable orrectly in the target
language.

Pure translierations are most often unreadable, or read very
incorrectly (even of the target language has a good support for
representing the most frequent realizations of a phonologic phoneme of
the source language).

This scheme could also help transcriptions from one language to
another that share the same script (e.g. English cheese transcripted
in French as tchise, ignoring the representation of long vowels that
are not heard in target French, or tchiise, but not tchīse as the
macron is not read distinctly in French). You may argue that you don't
need this because we already have IPA, but IPA is unreadable by most
people, and there's still the ned to use more conventional symbols
(and IPA is completely unreadable for readers of other scripts than
Latin, Greek or Cyrillic).

The application would be to transliterate people names or toponyms in
postal addresses or contact lists or on administrative forms to be
used in foreign countries where people can't decipher other scripts
(such as Arabic or sinograms), or in airports for travelling, or to
avoid that people really invent their own choice of name in another
script, in suc a way that the chosen name is not registered and
verifiable anywhere (unless these people have officially registered in
their own coutnry their alternate usage names, but very few
countries permit such registration of such usage names by individual
people).

For those countries that allow registration of people names in other
scripts than the national script, most often they will only allow the
usage of the Latin script (and frequently in a very restricted subset
of it), but not in Arabic, or Greek, or Cyrillic, or Japanese kanas.
To help this process, those countries are using their own national
standard of transliterators to the Latin script (i.e.
romanizations), simply because it is the most widely known and used
internationally (and in all computer applications) and have no other
support for registering additional usage names in other scripts, or
for registering additional usage names that will be dependant of the
target language (so these single romanizations supported will also be
read incorrectly in many target languages, or could be offensive in
those target languages and travellers may want ro use another usage
name in those target countries).

2012/11/10 Vinodh Rajan vinodh.vin...@gmail.com:
 Hi,

 These are several standards for transliterating Indic script to Roman
 characters such as IAST, ISO 15919 etc.

 I would like to know if any similar standards exist for expressing the Indic
 set in Greek  Cyrillic with special diacritics.

 If they do exist, any pointers to their Unicode representations.

 Thanks

 V

 --
 http://www.virtualvinodh.com




Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)

2012-03-06 Thread Philippe Verdy
We've got the example of the ISO 9 standard itself.

Le 5 mars 2012 22:46, Michael Everson ever...@evertype.com a écrit :
 On 5 Mar 2012, at 20:13, Benjamin M Scarborough wrote:

 There is a clear precedent here that the unifications of N2463 are not 
 necessarily the final fate of any of these characters. If the О Е letter for 
 Selkup should be disunified from U+0152/U+0153, then a proposal needs to be 
 submitted calling for the addition of the two letters to the UCS.

 Have you got examples, Ben?

 Michael Everson * http://www.evertype.com/




Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)

2012-03-05 Thread Denis Jacquerye
On Tue, Feb 28, 2012 at 4:00 AM, Philippe Verdy verd...@wanadoo.fr wrote:
 I am looking for the codes or assignements status of the Cyrillic
 letter OE/oe (ligatured) as used in Selkup (exactly similar to the
 Latin pair).

 This character pair has been part of the registration nr. 223 (in
 1998) by ISO of the (8-bit) extended Cyrillic character set for
 non-Slavic languages for bibliographic information interchange :

 http://www.itscj.ipsj.or.jp/sc2/open/02n3136.pdf

 According to this document, this character set had also been
 standardized as ISO 10756:1996. Note that it contains many other
 characters for which it did not document any mapping to the UCS in the
 then emerging ISO 10646 standard.

 It has even been part of proposals at the UTC and ISO the same year
 for including in the UCS, along with other characters (at that time,
 Michael Everson wrote a proposal, placing them in U+04EC, U+04ED, but
 since the, the slots have been used for other characters (that block
 is now full).

 It is also referenced in the ISO 9 Cyrillic/Latin transliteration standard.

 Still, there's no Cyrillic character I can find in the encoded UCS in
 other Cyrillic extended blocks that are not full (for example,  the
 CYRILLIC SUPPLEMENT block at U+0500-052F).

 Where are those characters ? And what about the remaining characters
 found in the Registration nr. 223 and ISO 10756:1996 ? And their
 status in the ISO 9 standard itself ?

 Thanks.

 -- Philippe.


According to ftp://std.dkuug.dk/jtc1/sc2/WG2/docs/n2463.doc the
Cyrillic Selkup OE is mapped to Latin OE:
CYRILLIC SMALL LETTER SELKUP O E to U+0153 LATIN SMALL LIGATURE OE
CYRILLIC CAPITAL LETTER SELKUP O E to U+0152 LATIN CAPITAL LIGATURE OE
Several other of those missing Cyrillic characters are simply mapped
to Latin ones or sort of decomposed.

-
Denis Moyogo Jacquerye




Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)

2012-03-05 Thread Benjamin M Scarborough
On Mon, Mar 5, 2012 at 19:35, Denis Jacquerye wrote:
 According to ftp://std.dkuug.dk/jtc1/sc2/WG2/docs/n2463.doc the
 Cyrillic Selkup OE is mapped to Latin OE:
 CYRILLIC SMALL LETTER SELKUP O E to U+0153 LATIN SMALL LIGATURE OE
 CYRILLIC CAPITAL LETTER SELKUP O E to U+0152 LATIN CAPITAL LIGATURE OE
 Several other of those missing Cyrillic characters are simply mapped
 to Latin ones or sort of decomposed. 

N2463 also maps twelve characters from ISO 10574 that have been disunified 
since 2002, namely:
04/06 CYRILLIC SMALL LETTER KURDISH QA is now U+051B CYRILLIC SMALL LETTER QA
04/09 CYRILLIC SMALL LETTER EL WITH MIDDLE HOOK is now U+0521 CYRILLIC SMALL 
LETTER EL WITH MIDDLE HOOK
04/10 CYRILLIC SMALL LETTER MORDVIN EL KA is now U+0515 CYRILLIC SMALL LETTER 
LHA
04/14 CYRILLIC SMALL LETTER EN WITH MIDDLE HOOK is now U+0523 CYRILLIC SMALL 
LETTER EN WITH MIDDLE HOOK
05/06 CYRILLIC CAPITAL LETTER KURDISH QA is now U+051A CYRILLIC CAPITAL LETTER 
QA
05/09 CYRILLIC CAPITAL LETTER EL WITH MIDDLE HOOK is now U+0520 CYRILLIC 
CAPITAL LETTER EL WITH MIDDLE HOOK
05/10 CYRILLIC CAPITAL LETTER MORDVIN EL KA is now U+0514 CYRILLIC CAPITAL 
LETTER LHA
05/14 CYRILLIC CAPITAL LETTER EN WITH MIDDLE HOOK is now U+0522 CYRILLIC 
CAPITAL LETTER EN WITH MIDDLE HOOK
06/03 CYRILLIC SMALL LETTER ER KA is now U+0517 CYRILLIC SMALL LETTER RHA
06/08 CYRILLIC SMALL LETTER KURDISH WE is now U+051D CYRILLIC SMALL LETTER WE
07/03 CYRILLIC CAPITAL LETTER ER KA is now U+0516 CYRILLIC CAPITAL LETTER RHA
07/08 CYRILLIC CAPITAL LETTER KURDISH WE is now U+051C CYRILLIC CAPITAL LETTER 
WE

There is a clear precedent here that the unifications of N2463 are not 
necessarily the final fate of any of these characters. If the О Е letter for 
Selkup should be disunified from U+0152/U+0153, then a proposal needs to be 
submitted calling for the addition of the two letters to the UCS.

It is worth noting that N2463 also decomposes four characters using U+0335, a 
practice which hasn't been used for decompositions since Unicode 1.1.

I also don't understand the mapping of 04/05 CYRILLIC SMALL LETTER CHECHEN KA 
and 05/05 CYRILLIC CAPITAL LETTER CHECHEN KA into U+043A CYRILLIC SMALL LETTER 
KA, U+030A COMBINING RING ABOVE and U+041A CYRILLIC CAPITAL LETTER KA. U+030A 
COMBINING RING ABOVE, respectively. Is the character shown in ISO 10574 just a 
glyph variant of this combining sequence?

—Ben Scarborough




Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)

2012-03-05 Thread Philippe Verdy
Le 5 mars 2012 19:35, Denis Jacquerye moy...@gmail.com a écrit :
 On Tue, Feb 28, 2012 at 4:00 AM, Philippe Verdy verd...@wanadoo.fr wrote:
 I am looking for the codes or assignements status of the Cyrillic
 letter OE/oe (ligatured) as used in Selkup (exactly similar to the
 Latin pair).

 This character pair has been part of the registration nr. 223 (in
 1998) by ISO of the (8-bit) extended Cyrillic character set for
 non-Slavic languages for bibliographic information interchange :

 http://www.itscj.ipsj.or.jp/sc2/open/02n3136.pdf

 According to this document, this character set had also been
 standardized as ISO 10756:1996. Note that it contains many other
 characters for which it did not document any mapping to the UCS in the
 then emerging ISO 10646 standard.

 It has even been part of proposals at the UTC and ISO the same year
 for including in the UCS, along with other characters (at that time,
 Michael Everson wrote a proposal, placing them in U+04EC, U+04ED, but
 since the, the slots have been used for other characters (that block
 is now full).

 It is also referenced in the ISO 9 Cyrillic/Latin transliteration standard.

 Still, there's no Cyrillic character I can find in the encoded UCS in
 other Cyrillic extended blocks that are not full (for example,  the
 CYRILLIC SUPPLEMENT block at U+0500-052F).

 Where are those characters ? And what about the remaining characters
 found in the Registration nr. 223 and ISO 10756:1996 ? And their
 status in the ISO 9 standard itself ?

 Thanks.

 -- Philippe.


 According to ftp://std.dkuug.dk/jtc1/sc2/WG2/docs/n2463.doc the
 Cyrillic Selkup OE is mapped to Latin OE:
 CYRILLIC SMALL LETTER SELKUP O E to U+0153 LATIN SMALL LIGATURE OE
 CYRILLIC CAPITAL LETTER SELKUP O E to U+0152 LATIN CAPITAL LIGATURE OE
 Several other of those missing Cyrillic characters are simply mapped
 to Latin ones or sort of decomposed.

Apparently this document is obsolete. Some of the proposed mappings to
Latin have been encoded as plain Cyrillic letters such as:

CYRILLIC SMALL LETTER KURDISH QA

(not the initially proposed mapping to LATIN SMALL LETTER Q)

This document was still a draft, and not a decision.

The document specifically says The issue with these letters is
whether they should be deunified from Latin, and encoded in the
Cyrillic block.




Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)

2012-03-05 Thread Michael Everson
On 5 Mar 2012, at 20:13, Benjamin M Scarborough wrote:

 There is a clear precedent here that the unifications of N2463 are not 
 necessarily the final fate of any of these characters. If the О Е letter for 
 Selkup should be disunified from U+0152/U+0153, then a proposal needs to be 
 submitted calling for the addition of the two letters to the UCS.

Have you got examples, Ben? 

Michael Everson * http://www.evertype.com/





CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)

2012-02-27 Thread Philippe Verdy
I am looking for the codes or assignements status of the Cyrillic
letter OE/oe (ligatured) as used in Selkup (exactly similar to the
Latin pair).

This character pair has been part of the registration nr. 223 (in
1998) by ISO of the (8-bit) extended Cyrillic character set for
non-Slavic languages for bibliographic information interchange :

http://www.itscj.ipsj.or.jp/sc2/open/02n3136.pdf

According to this document, this character set had also been
standardized as ISO 10756:1996. Note that it contains many other
characters for which it did not document any mapping to the UCS in the
then emerging ISO 10646 standard.

It has even been part of proposals at the UTC and ISO the same year
for including in the UCS, along with other characters (at that time,
Michael Everson wrote a proposal, placing them in U+04EC, U+04ED, but
since the, the slots have been used for other characters (that block
is now full).

It is also referenced in the ISO 9 Cyrillic/Latin transliteration standard.

Still, there's no Cyrillic character I can find in the encoded UCS in
other Cyrillic extended blocks that are not full (for example,  the
CYRILLIC SUPPLEMENT block at U+0500-052F).

Where are those characters ? And what about the remaining characters
found in the Registration nr. 223 and ISO 10756:1996 ? And their
status in the ISO 9 standard itself ?

Thanks.

-- Philippe.



Re: Are Latin and Cyrillic essentially the same script?

2010-11-23 Thread Michael Everson
On 22 Nov 2010, at 18:55, Asmus Freytag wrote:

 That seems to be true for IPA as well - because already, if you use the font 
 binding for IPA, your a's and g's will not come out right, which means you 
 don't even have to worry about betas and chis.


Not so. There is already a convention (going back to the late 19th or early 
20th century) about handling this. 

In an ordinary Times-like font, a slopes and loses its hat when italicized. 
In an ordinary Times-like font, ɑ is replaced by an italic Greek α (alpha). 

Michael Everson * http://www.evertype.com/





Re: Are Latin and Cyrillic essentially the same script?

2010-11-22 Thread Michael Everson
On 19 Nov 2010, at 07:15, Peter Constable wrote:

 And while IPA is primarily based on Latin script, not all of its characters 
 are Latin characters: bilabial and interdental fricative phonemes are 
 represented using Greek letters beta and theta.

IPA beta and chi behave very differently from their Greek antecedents and 
should not remain unified. The case for theta is messier because theta is so 
very messy.

Michael Everson * http://www.evertype.com/





Re: Are Latin and Cyrillic essentially the same script?

2010-11-22 Thread Michael Everson
On 19 Nov 2010, at 17:09, Peter Constable wrote:

 And historic texts aren’t as likely or unlikely to require specialized fonts?

Twenty years of historic text in Tatar isn't irrelevant. 


 It's also a notational system that requires specific training in its use, 
 
 And working with historic texts doesn’t require specific training?

Not in terms of Jaŋalif. The training you need there is just learn to read the 
language in another alphabet. IPA is more complex than that, especially if you 
go for close transcription.

 While several orthographies have been based on IPA, my understanding is 
 that some of them saw the encoding of additional characters to make them 
 work as orthographies.
 
 Again, I don’t see how that impacts this particular case.

This particular case is analogous to the borrowing of Q and W into Cyrillic 
from Latin. 

By the way I understand that there are many people who would like to revert to 
the Latin orthography for these Turkic languages. At present Russian law 
forbids this, but it is not the case that one may expect that this orthography 
will always remain historic. 

 It boils down to this: just as there aren’t technical or usability reasons 
 that make it problematic to represent IPA text using two Greek characters in 
 an otherwise-Latin system,

Yes there are. Sorting multilingual text including Greek and IPA 
transcriptions, for one. The glyph shape for IPA beta is practically unknown in 
Greek. Latin capital Chi is not the same as Greek capital chi. 

 so also there are no technical or usability reasons I’m aware of why it is 
 problematic to represent this historic Janalif orthography using two Cyrillic 
 characters.

They are the same technical and usability reasons which led to the 
disunification of Cyrillic Ԛ and Ԝ from Latin Q and W.

Michael Everson * http://www.evertype.com/





Re: Are Latin and Cyrillic essentially the same script?

2010-11-22 Thread Asmus Freytag

On 11/22/2010 4:15 AM, Michael Everson wrote:

It boils down to this: just as there aren’t technical or usability reasons that 
make it problematic to represent IPA text using two Greek characters in an 
otherwise-Latin system,

Yes there are. Sorting multilingual text including Greek and IPA 
transcriptions, for one. The glyph shape for IPA beta is practically unknown in 
Greek. Latin capital Chi is not the same as Greek capital chi.


  so also there are no technical or usability reasons I’m aware of why it is 
problematic to represent this historic Janalif orthography using two Cyrillic 
characters.

They are the same technical and usability reasons which led to the 
disunification of Cyrillic Ԛ and Ԝ from Latin Q and W.


The sorting problem I think I understand.

Because scripts are kept together in sorting, when you have a mixed 
script list, you normally overrides just the sorting for the script to 
which the (sort-)language belongs. A mixed French-Russian list would use 
French ordering for the Latin characters, but the Russian words would 
all appear together (and be sorted according to some generic sort order 
for Cyrillic characters - except that for a bilingual list, sorting the 
Cyrillic according to Russian rules might also make sense.).


Same for a French-Greek list. The Greek characters will be together and 
sorted either by a generic Greek (script) sort, or a specific Greek 
(language) sort.When you sort a mixed list of IPA and Greek, the beta 
and chi will now sort with the Latin characters, in whatever sort order 
applies for IPA. That means the order of all Greek words in the list 
will get messed up. It will neither be a generic Greek (script) sort, 
nor a specific Greek (language) sort, because you can't tailor the same 
characters two different ways in the same sort.


That's the problem I understand is behind the issue with the Kurdish Q 
and W, and with the character pair proposed for disunification for Janalif.


Perhaps, it seems, there are some technical problems that would make the 
support for such mixed-script orthographies not as seamless as for 
regular orthographies after all.


In that case, a decision would boil down to whether these technical 
issues are significant enough (given the usage).


In other words, it becomes a cost-benefit analysis. Duplication of 
characters (except where their glyphs have acquired a different 
appearance in the other context) always has a cost in added 
confusability. Users can select the wrong character accidentally, 
spoofers can do so intentionally to try to cause harm. But Unicode was 
never just a list of distinct glyphs, so duplication between Latin and 
Greek, or Latin and Cyrillic is already widespread, especially among the 
capitals.


Unlike what Michael claims for IPA, the Janalif characters don't seem to 
have a very different appearance, so there would not be any technical or 
usability issue there. Minor glyph variations can be handled by standard 
technologies, like OpenType, as long as the overall appearance remains 
legible should language binding of a text have gotten lost.


That seems to be true for IPA as well - because already, if you use the 
font binding for IPA, your a's and g's will not come out right, which 
means you don't even have to worry about betas and chis.


IPA being a notation, I would not be surprised to learn that mixed lists 
with both IPA and other terms are a rare thing. But for Janalif it would 
seem that mixed Janalif/Cyrillic lists would be rather common, relative 
to the size of the corpus, even if its a dead (or currently out of use) 
orthography.


I'd like to see this addressed a bit more in detail by those who support 
the decision to keep the borrowed characters unified.


A./


Re: Are Latin and Cyrillic essentially the same script?

2010-11-19 Thread Asmus Freytag

On 11/18/2010 11:15 PM, Peter Constable wrote:

If you'd like a precedent, here's one:


Yes, I think discussion of precedents is important - it leads to the 
formulation of encoding principles that can then (hopefully) result in 
more consistency in future encoding efforts.


Let me add the caveat that I fully understand that character encoding 
doesn't work by applying cook-book style recipes, and that principles 
are better phrased as criteria for weighing a decision rather than as 
formulaic rules.


With these caveats, then:

  IPA is a widely-used system of transcription based primarily on the Latin 
script. In comparison to the Janalif orthography in question, there is far more 
existing data. Also, whereas that Janalif orthography is no longer in active 
use--hence there are not new texts to be represented (there are at best only 
new citations of existing texts), IPA is as a writing system in active use with 
new texts being created daily; thus, the body of digitized data for IPA is 
growing much more that is data in the Janalif orthography. And while IPA is 
primarily based on Latin script, not all of its characters are Latin 
characters: bilabial and interdental fricative phonemes are represented using 
Greek letters beta and theta.


IPA has other characteristics in both its usage and its encoding that 
you need to consider to make the comparison valid.


First, IPA requires specialized fonts because it relies on glyphic 
distinctions that fonts not designed for IPA use will not guarantee. 
(Latin a with and without hook, g with hook vs. two stories are just two 
examples). It's also a notational system that requires specific training 
in its use, and it is caseless - in distinction to ordinary Latin script.


While several orthographies have been based on IPA, my understanding is 
that some of them saw the encoding of additional characters to make them 
work as orthographies.


Finally, IPA, like other phonetic notations, uses distinctions between 
letter forms on the character level that would almost always be 
relegated to styling in ordinary text.


Because of these special aspects of IPA, I would class it in its own 
category of writing systems which makes it less useful as a precedent 
against which to evaluate general Latin-based orthographies.



Given a precedent of a widely-used Latin writing system for which it is 
considered adequate to have characters of central importance represented using 
letters from a different script, Greek, it would seem reasonable if someone 
made the case that it's adequate to represent an historic Latin orthography 
using Cyrillic soft sign.


I think the question can and should be asked, what is adequate for a 
historic orthography. (I don't know anything about the particulars of 
Janalif, beyond what I read here, so for now, I accept your 
categorization of it as if it were fact).


The precedent for historic orthographies is a bit uneven in Unicode. 
Some scripts have extensive collection of characters (even duplicates or 
near duplicates) to cover historic usage. Other historic orthographies 
cannot be fully represented without markup. And some are now better 
supported than at the beginning because the encoding has plugged certain 
gaps.


A helpful precedent in this case would be that of another minority or 
historic orthography, or historic minority orthography for which the use 
of Greek or Cyrillic characters with Latin was deemed acceptable. I 
don't think Janalif is totally unique (although the others may not be 
dead). I'm thinking of the Latin OU that was encoded based on a Greek 
ligature, and the perennial question of the Kurdish Q an W (Latin 
borrowings into Cyrillic - I believe these are now 051A and 051C). 
Again, these may be for living orthographies.


   /Against this backdrop, it would help if WG2 (and UTC) could point
   to agreed upon criteria that spell out what circumstances should
   favor, and what circumstances should disfavor, formal encoding of
   borrowed characters, in the LGC script family or in the general case./


That's the main point I'm trying to make here. I think it is not enough 
to somehow arrive at a decision for one orthography, but it is necessary 
for the encoding committees to grab hold of the reasoning behind that 
decision and work out how to apply consistent reasoning like that in 
future cases.


This may still feel a little bit unsatisfactory for those whose proposal 
is thus becoming the test-case to settle a body of encoding principles, 
but to that I say, there's been ample precedent for doing it that way in 
Unicode and 10646.


So let me ask these questions:

   A. What are the encoding principles that follow from the disposition
   of the Janalif proposal?

   B. What precedents are these based on resp. what precedents are
   consciously established by this decision?


A./




RE: Are Latin and Cyrillic essentially the same script?

2010-11-19 Thread Peter Constable
From: Asmus Freytag [mailto:asm...@ix.netcom.com] 

 IPA has other characteristics in both its usage and its encoding that you 
 need to consider to make the comparison valid.

 First, IPA requires specialized fonts because it relies on glyphic 
 distinctions 
 that fonts not designed for IPA use will not guarantee.

And historic texts aren’t as likely or unlikely to require specialized fonts?


 It's also a notational system that requires specific training in its use, 

And working with historic texts doesn’t require specific training?

 and it  is caseless - in distinction to ordinary Latin script.

I could understand how that might be relevant if we were discussing a character 
borrowed from another script but with different casing behaviour in the 
original script. (E.g., the character is caseless in the original script, or it 
is case but only the lowercase was borrowed and a novel uppercase character was 
created in the receptor script. This was a valid consideration in the encoding 
of Lisu, for instance.) I don’t really see how that impacts the discussion in 
this particular case. 


 While several orthographies have been based on IPA, my understanding is 
 that some of them saw the encoding of additional characters to make them 
 work as orthographies.

Again, I don’t see how that impacts this particular case.


 Finally, IPA, like other phonetic notations, uses distinctions between letter 
 forms on the character level that would almost always be relegated to styling 
 in ordinary text.

And again, I don’t see how this impacts the particular case under discussion.


 Because of these special aspects of IPA, I would class it in its own category 
 of writing systems which makes it less useful as a precedent against which to 
 evaluate general Latin-based orthographies.

Perhaps in general it cannot serve as a precedent for all things. But as noted, 
I think several of the things you noted have no particular bearing in this 
case. For the specific issue of borrowing a character from another script in a 
historic orthography, I think it’s a perfectly valid precedent. It boils down 
to this: just as there aren’t technical or usability reasons that make it 
problematic to represent IPA text using two Greek characters in an 
otherwise-Latin system, so also there are no technical or usability reasons I’m 
aware of why it is problematic to represent this historic Janalif orthography 
using two Cyrillic characters.

Btw, I suspect that calling these Latin characters is completely revisionist: 
if we could ask anyone that taught or used this orthography in 1930 about these 
characters, I suspect they would say that they are Cyrillic characters.


 I think the question can and should be asked, what is adequate for a historic 
 orthography.

Clearly you’re trying to have a discussion about general principles, not about 
the specific characters. At the moment, I’m prepared to discuss general 
principles to the extent that they impinge on the particular case at hand. 
Other’s may wish to engage on a broader discussion of general principles 
(though, hopefully under a different subject).

 Against this backdrop, it would help if WG2 (and UTC) could point to agreed 
 upon criteria that spell out what circumstances should favor, and what 
 circumstances should disfavor, formal encoding of borrowed characters, in the 
 LGC script family or in the general case.

 That's the main point I'm trying to make here. I think it is not enough to 
 somehow 
 arrive at a decision for one orthography, but it is necessary for the 
 encoding 
 committees to grab hold of the reasoning behind that decision and work out 
 how 
 to apply consistent reasoning like that in future cases.

These are not unreasonable requests. I don’t see any inconsistency in practice 
as it relates to this particular case, however.

 So let me ask these questions:
 A. What are the encoding principles that follow from the disposition of the 
 Janalif 
 proposal?

I think one principle is that we do not always have to maintain a principle of 
orthographic script purity. In particular, in the case of historic 
orthographies no longer in active use that borrowed characters from another 
script in the LGC family, if there are no technical or usability reasons that 
make it problematic to represent those text elements using existing characters 
from the source script, then it is not necessary to encode equivalents in the 
receptor script so that we can say that the historic orthography is a 
pure-Latin / pure-Greek / pure-Cyrillic orthography (which, in terms of social 
history rather than character encoding, would likely be a revisionist 
perspective).


 B. What precedents are these based on resp. what precedents are consciously 
 established by this decision?

I'm not sure I fully understand the question so won't venture a comment.



Peter




RE: Are Latin and Cyrillic essentially the same script?

2010-11-18 Thread Peter Constable
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of André Szabolcs Szelp

 AFAIR the reservations of WG2 concerning the encoding of Jangalif 
 Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but 
 rather in view of its potential identity with the tone sign mentioned 
 by you as well. It is a Latin letter adapted from the Cyrillic soft sign, 

There's another possible point of view: that it's a Cyrillic character that, 
for a short period, people tried using as a Latin character but that never 
stuck, and that it's completely adequate to represent Janalif text in that 
orthography using the Cyrillic soft sign.



Peter




Re: Are Latin and Cyrillic essentially the same script?

2010-11-18 Thread Asmus Freytag

On 11/18/2010 8:04 AM, Peter Constable wrote:

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of André Szabolcs Szelp


AFAIR the reservations of WG2 concerning the encoding of Jangalif
Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but
rather in view of its potential identity with the tone sign mentioned
by you as well. It is a Latin letter adapted from the Cyrillic soft sign,

There's another possible point of view: that it's a Cyrillic character that, 
for a short period, people tried using as a Latin character but that never 
stuck, and that it's completely adequate to represent Janalif text in that 
orthography using the Cyrillic soft sign.




When one language borrows a word from another, there are several stages 
of foreignness, ranging from treating the foreign word as a short 
quotation in the original language to treating it as essentially fully 
native.


Now words are very complex in behavior and usage compared to characters. 
You can check for pronunciation, spelling and adaptation to the host 
grammar to check which stage of adaptation a word has reached.


When a script borrows a letter from another, you are essentially limited 
in what evidence you can use to document objectively whether the 
borrowing has crossed over the script boundary and the character has 
become native.


With typographically closely related scripts, getting tell-tale 
typographical evidence is very difficult. After all, these scripts 
started out from the same root.


So, you need some other criteria.

You could individually compare orthographies and decide which ones are 
important enough (or established enough) to warrant support. Or you 
could try to distinguish between orthographies for general use withing 
the given language, vs. other systems of writing (transcriptions, say).


But whatever you do, you should be consistent and take account of 
existing precedent.


There are a number of characters encoded as nominally Latin in Unicode 
that are borrowings from other scripts, usually Greek.


A discussion of the current issue should include explicit explanation of 
why these precedents apply or do not apply, and, in the latter case, why 
some precedents may be regarded as examples of past mistakes.


By explicitly analyzing existing precedents, it should be possible to 
avoid the impression that the current discussion is focused on the 
relative merits of a particular orthography based on personal and 
possibly arbitrary opinions by the work group experts.


If it can be shown that all other cases where such borrowings were 
accepted into Unicode are based on orthographies that are more 
permanent, more widespread or both, or where other technical or 
typographical reasons prevailed that are absent here, then it would make 
any decision on the current request seem a lot less arbitrary.


I don't know where the right answer lies in the case of Janalif, or 
which point of view, in Peter's phrasing, would make the most sense, but 
having this discussion without clear understanding of the precedents 
will lead to inconsistent encoding.


A./



pupil's comment: Are Latin and Cyrillic essentially the same script?

2010-11-18 Thread JP Blankert (thuis PC based)

Dear all,

Still see myself as pupil reading introduction chart of unicode, but I 
am happy to join the discussion on the Russian: it is quite different 
from Latin. Apart from 33 characters in Russian alphabet = more 
characters and apart from quite a few characters that as English speaker 
you clearly do not know, Latin and Russian indeed contain some similar 
characters. But watch out. There are if I am correct 3 a's in the world, 
in this email a (Latin) looks like a (Russian) but they are different. 
So the Russian a is quite suited for a hierogplyph attack (I will try 
ontslag.com, which is Dutch for dismissal.com, to see how search engines 
react. With Russian a. Punycode is different of the word as total).


Similar example: Ukraine i - looks like ours, but you can't register it 
on .rf (Russian Federation).


Experiment 1 year ago with *Reïntegratie.com* 
http://www.google.nl/aclk?sa=lai=Cq32OAcrlTIelNsGTOoCQ8Z4GwoKpugHavNrYFpf09AgIABADKANQppe9lfj_AWCRvJqFhBigAaryw_4DyAEBqQJLcsn7dNi2PqoEHE_QPDrLX54nLEfeere4hVxwC4D9yTrI81AEiP26BRMI9ayF7dSrpQIVyo0OCh1WKGKjygUAei=AcrlTLWoLsqbOtbQiJsKsig=AGiWqtxaX45Uf8wTKRjRJAdJsIX8fkSunAadurl=http://www.arboned.nl/diensten/arbeidsdeskundig-advies/dienst/arbeidsdeskundig-reintegratieonderzoek/ 
being correct Dutch for reintegration, but being impossible as 
domainname because SIDN.nl (supposed to be nic.nl) is very conservative 
and does not even allow signs gave as result: in the beginning Google 
appreciated and appreciated itafter a few months the hosted and 
filled site 'sank'.(I borrowed the **ï* 
http://www.google.nl/aclk?sa=lai=Cq32OAcrlTIelNsGTOoCQ8Z4GwoKpugHavNrYFpf09AgIABADKANQppe9lfj_AWCRvJqFhBigAaryw_4DyAEBqQJLcsn7dNi2PqoEHE_QPDrLX54nLEfeere4hVxwC4D9yTrI81AEiP26BRMI9ayF7dSrpQIVyo0OCh1WKGKjygUAei=AcrlTLWoLsqbOtbQiJsKsig=AGiWqtxaX45Uf8wTKRjRJAdJsIX8fkSunAadurl=http://www.arboned.nl/diensten/arbeidsdeskundig-advies/dienst/arbeidsdeskundig-reintegratieonderzoek/ 
*from Catalan, amidst Latin characters).


News about ss / sz to whom is interested: most Germans were alert 
(ss-holders had priority to /ß)//, /so no/Fußbal/l for me, but only 
experimental domain names IDNexpress.de and IDNexpre/ß.de. /It was a 
mini-landrush on Nov. 16 2010, 10:00 German time onwards (Denic.de)

/
/Very busy with .rf auction now, in December I will put 2 different 
sites on these ss and sz names so people can wonder at their screens to 
see what is happening.


Above reaction was more out of domain names and practical experience 
than chartUTFxyz - but definitely: different script.


Br,

Philippe


On 18-11-2010 20:04, Asmus Freytag wrote:

On 11/18/2010 8:04 AM, Peter Constable wrote:
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
On Behalf Of André Szabolcs Szelp



AFAIR the reservations of WG2 concerning the encoding of Jangalif
Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but
rather in view of its potential identity with the tone sign mentioned
by you as well. It is a Latin letter adapted from the Cyrillic soft 
sign,
There's another possible point of view: that it's a Cyrillic 
character that, for a short period, people tried using as a Latin 
character but that never stuck, and that it's completely adequate to 
represent Janalif text in that orthography using the Cyrillic soft sign.





When one language borrows a word from another, there are several 
stages of foreignness, ranging from treating the foreign word as a 
short quotation in the original language to treating it as essentially 
fully native.


Now words are very complex in behavior and usage compared to 
characters. You can check for pronunciation, spelling and adaptation 
to the host grammar to check which stage of adaptation a word has 
reached.


When a script borrows a letter from another, you are essentially 
limited in what evidence you can use to document objectively whether 
the borrowing has crossed over the script boundary and the character 
has become native.


With typographically closely related scripts, getting tell-tale 
typographical evidence is very difficult. After all, these scripts 
started out from the same root.


So, you need some other criteria.

You could individually compare orthographies and decide which ones are 
important enough (or established enough) to warrant support. Or 
you could try to distinguish between orthographies for general use 
withing the given language, vs. other systems of writing 
(transcriptions, say).


But whatever you do, you should be consistent and take account of 
existing precedent.


There are a number of characters encoded as nominally Latin in 
Unicode that are borrowings from other scripts, usually Greek.


A discussion of the current issue should include explicit explanation 
of why these precedents apply or do not apply, and, in the latter 
case, why some precedents may be regarded as examples of past mistakes.


By explicitly analyzing existing precedents, it should be possible to 
avoid

RE: Are Latin and Cyrillic essentially the same script?

2010-11-18 Thread Peter Constable
If you'd like a precedent, here's one: IPA is a widely-used system of 
transcription based primarily on the Latin script. In comparison to the Janalif 
orthography in question, there is far more existing data. Also, whereas that 
Janalif orthography is no longer in active use--hence there are not new texts 
to be represented (there are at best only new citations of existing texts), IPA 
is as a writing system in active use with new texts being created daily; thus, 
the body of digitized data for IPA is growing much more that is data in the 
Janalif orthography. And while IPA is primarily based on Latin script, not all 
of its characters are Latin characters: bilabial and interdental fricative 
phonemes are represented using Greek letters beta and theta.

Given a precedent of a widely-used Latin writing system for which it is 
considered adequate to have characters of central importance represented using 
letters from a different script, Greek, it would seem reasonable if someone 
made the case that it's adequate to represent an historic Latin orthography 
using Cyrillic soft sign.


Peter


-Original Message-
From: Asmus Freytag [mailto:asm...@ix.netcom.com] 
Sent: Thursday, November 18, 2010 11:05 AM
To: Peter Constable
Cc: André Szabolcs Szelp; Karl Pentzlin; unicode@unicode.org; Ilya Yevlampiev
Subject: Re: Are Latin and Cyrillic essentially the same script?

On 11/18/2010 8:04 AM, Peter Constable wrote:
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
 On Behalf Of André Szabolcs Szelp

 AFAIR the reservations of WG2 concerning the encoding of Jangalif 
 Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but 
 rather in view of its potential identity with the tone sign mentioned 
 by you as well. It is a Latin letter adapted from the Cyrillic soft 
 sign,
 There's another possible point of view: that it's a Cyrillic character that, 
 for a short period, people tried using as a Latin character but that never 
 stuck, and that it's completely adequate to represent Janalif text in that 
 orthography using the Cyrillic soft sign.



When one language borrows a word from another, there are several stages of 
foreignness, ranging from treating the foreign word as a short quotation in 
the original language to treating it as essentially fully native.

Now words are very complex in behavior and usage compared to characters. 
You can check for pronunciation, spelling and adaptation to the host grammar to 
check which stage of adaptation a word has reached.

When a script borrows a letter from another, you are essentially limited in 
what evidence you can use to document objectively whether the borrowing has 
crossed over the script boundary and the character has become native.

With typographically closely related scripts, getting tell-tale typographical 
evidence is very difficult. After all, these scripts started out from the same 
root.

So, you need some other criteria.

You could individually compare orthographies and decide which ones are 
important enough (or established enough) to warrant support. Or you could 
try to distinguish between orthographies for general use withing the given 
language, vs. other systems of writing (transcriptions, say).

But whatever you do, you should be consistent and take account of existing 
precedent.

There are a number of characters encoded as nominally Latin in Unicode that 
are borrowings from other scripts, usually Greek.

A discussion of the current issue should include explicit explanation of why 
these precedents apply or do not apply, and, in the latter case, why some 
precedents may be regarded as examples of past mistakes.

By explicitly analyzing existing precedents, it should be possible to avoid the 
impression that the current discussion is focused on the relative merits of a 
particular orthography based on personal and possibly arbitrary opinions by the 
work group experts.

If it can be shown that all other cases where such borrowings were accepted 
into Unicode are based on orthographies that are more permanent, more 
widespread or both, or where other technical or typographical reasons prevailed 
that are absent here, then it would make any decision on the current request 
seem a lot less arbitrary.

I don't know where the right answer lies in the case of Janalif, or which point 
of view, in Peter's phrasing, would make the most sense, but having this 
discussion without clear understanding of the precedents will lead to 
inconsistent encoding.

A./





Re: Are Latin and Cyrillic essentially the same script?

2010-11-17 Thread André Szabolcs Szelp
AFAIR the reservations of WG2 concerning the encoding of Jangalif
Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but
rather in view of its potential identity with the tone sign mentioned
by you as well. It is a Latin letter adapted from the Cyrillic soft
sign, like the Jangalif character. Function, as you point out, is not
a distinctive feature. The different serif style which you pointed out
cannot be seen as discriminating features of character identity,
especially not in a time of bad typography (and actually lack of latin
typographic tradition in China of the time).


/Sz

On Wed, Nov 10, 2010 at 5:08 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 As shown in N3916: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3916.pdf
 = L2/10-356, there exists a Latin letter which resembles the Cyrillic
 soft sign Ь/ь (U+042C/U+044C). This letter is part of the Jaꞑalif
 variant of the alphabet, which was used for several languages in the
 former Soviet Union (e.g. Tatar), and was developed in parallel to the
 alphabet nowadays in use for Turk and Azerbaijan, see:
 http://en.wikipedia.org/wiki/Janalif .
 In fact, it was proposed on this base, being the only Jaꞑalif letter
 missing so far, since the ꞑ (occurring in the alphabet name itself)
 was introduced with Unicode 6.0.

 The letter is no soft sign; it is the exact Tatar equivalent of the
 Turkish dotless i, thus it has a similar use as the Cyrillic yeru
 Ы/ы (U+042B/U+044B).

 In this function, it is a part of the adaptation of the Latin alphabet
 for a lot of non-Russian languages in the Soviet Union in the 1920s,
 see e.g.: Юшманов, Н. В.: Определитель Языков. Москва/Ленинград 1941,
 http://fotki.yandex.ru/users/ievlampiev/view/155697?page=3 .
 (A proposal regarding this subject is expected for 2011.)

 Thus, it shares with the Cyrillic soft sign its form and partly the
 geographical area of its use, but in no case its meaning. Similar can
 be said e.g. for P/p (U+0050/U+0070, Latin letter P) and Р/р
 (U+0420/U+0440, Cyrillic letter ER).

 According to the pre-preliminary minutes of UTC #125 (L2/10-415),
 the UTC has not accepted the Latin Ь/ь.

 It is an established practice for the European alphabetic scripts to
 encode a new letter only if it has a different shape (in at least one
 of the capital and small forms) regarding to all already encoded
 letter of the same script. The Y/y is well known to denote completely
 different pronunciations, used as consonant as well as vocal, even within
 the same language. Thus, if somebody unearths a Latin letter E/e in some
 obscure minority language which has no E-like vocal, to denote a M-like
 sound and in fact to be collated after the M in the local alphabet, this
 will probably not lead to a new encoding.

 But, Latin and Cyrillic are different scripts (the question in the Re
 of this mail is rhetorical, of course).

 Admittedly, there also is a precedence for using Cyrillic letters in
 Latin text: the use of U+0417/U+0437 and U+0427/U+0447 for tone
 letters in Zhuang. However, the orthography using them was
 short-lived, being superseded by another Latin orthography which uses
 genuine Latin letters as tone marks (J/j and X/x, in this case).

 On the other hand, Jaꞑalif and the other Latin alphabets which use Ь/ь
 did not lose the Ь/ь by an improvement of the orthography, but were
 completely deprecated by an ukase of Stalin. Thus, they continue to be
 the Latin alphabets of the respective languages.
 Whether formally requesting a revival or not, they are regarded as valid
 by the members of the cultural group (even if only to access their cultural
 inheritance).
 Especially, it cannot be excluded that persons want to create Latin domain
 names or e-mail addresses without being accused for script mixing.

 Taking this into account, not mentioning the technical problems
 regarding collation etc. and the typographical issues when it comes to
 subtle differences between Latin and Cyrillic in high quality
 typography, it is really hard to understand why the UTC refuses to encode
 the Latin Ь/ь.

 A quick glance at the Юшманов table mentioned above proves that there
 is absolutely no request to duplicate the whole Cyrillic alphabet in
 Latin, as someone may have feared.

 - Karl Pentzlin






-- 
Szelp, André Szabolcs

+43 (650) 79 22 400




Are Latin and Cyrillic essentially the same script?

2010-11-10 Thread Karl Pentzlin
As shown in N3916: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3916.pdf
= L2/10-356, there exists a Latin letter which resembles the Cyrillic
soft sign Ь/ь (U+042C/U+044C). This letter is part of the Jaꞑalif
variant of the alphabet, which was used for several languages in the
former Soviet Union (e.g. Tatar), and was developed in parallel to the
alphabet nowadays in use for Turk and Azerbaijan, see:
http://en.wikipedia.org/wiki/Janalif .
In fact, it was proposed on this base, being the only Jaꞑalif letter
missing so far, since the ꞑ (occurring in the alphabet name itself)
was introduced with Unicode 6.0.

The letter is no soft sign; it is the exact Tatar equivalent of the
Turkish dotless i, thus it has a similar use as the Cyrillic yeru
Ы/ы (U+042B/U+044B).

In this function, it is a part of the adaptation of the Latin alphabet
for a lot of non-Russian languages in the Soviet Union in the 1920s,
see e.g.: Юшманов, Н. В.: Определитель Языков. Москва/Ленинград 1941,
http://fotki.yandex.ru/users/ievlampiev/view/155697?page=3 .
(A proposal regarding this subject is expected for 2011.)

Thus, it shares with the Cyrillic soft sign its form and partly the
geographical area of its use, but in no case its meaning. Similar can
be said e.g. for P/p (U+0050/U+0070, Latin letter P) and Р/р
(U+0420/U+0440, Cyrillic letter ER).

According to the pre-preliminary minutes of UTC #125 (L2/10-415),
the UTC has not accepted the Latin Ь/ь.

It is an established practice for the European alphabetic scripts to
encode a new letter only if it has a different shape (in at least one
of the capital and small forms) regarding to all already encoded
letter of the same script. The Y/y is well known to denote completely
different pronunciations, used as consonant as well as vocal, even within
the same language. Thus, if somebody unearths a Latin letter E/e in some
obscure minority language which has no E-like vocal, to denote a M-like
sound and in fact to be collated after the M in the local alphabet, this
will probably not lead to a new encoding.

But, Latin and Cyrillic are different scripts (the question in the Re
of this mail is rhetorical, of course).

Admittedly, there also is a precedence for using Cyrillic letters in
Latin text: the use of U+0417/U+0437 and U+0427/U+0447 for tone
letters in Zhuang. However, the orthography using them was
short-lived, being superseded by another Latin orthography which uses
genuine Latin letters as tone marks (J/j and X/x, in this case).

On the other hand, Jaꞑalif and the other Latin alphabets which use Ь/ь
did not lose the Ь/ь by an improvement of the orthography, but were
completely deprecated by an ukase of Stalin. Thus, they continue to be
the Latin alphabets of the respective languages.
Whether formally requesting a revival or not, they are regarded as valid
by the members of the cultural group (even if only to access their cultural
inheritance).
Especially, it cannot be excluded that persons want to create Latin domain
names or e-mail addresses without being accused for script mixing.

Taking this into account, not mentioning the technical problems
regarding collation etc. and the typographical issues when it comes to
subtle differences between Latin and Cyrillic in high quality
typography, it is really hard to understand why the UTC refuses to encode
the Latin Ь/ь.

A quick glance at the Юшманов table mentioned above proves that there
is absolutely no request to duplicate the whole Cyrillic alphabet in
Latin, as someone may have feared.

- Karl Pentzlin




Re: Are Latin and Cyrillic essentially the same script?

2010-11-10 Thread Karl Pentzlin
2010-11-10 10:08, I wrote:

KP As shown in N3916 ...

Please read vowel instead of vocal throughout the mail. Sorry.




Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-10 Thread William_J_G Overington
Thank you for replying.
 
On Saturday, 7 August 2010, Doug Ewell d...@ewellic.org wrote:
 
 I think the alternate ending glyph is supposed to be
 specified in more detail than that.  The example Asmus
 gave was U+222A UNION with serifs. Even though the exact
 proportions of the serifs may differ from one font to the
 next, this is still a relatively precise and constrained
 definition, unlike Latin small letter e with some
 'alternate ending' which is completely up to the discretion
 of the font designer.
 
 Because of stylistic differences among calligraphers—this
 is a calligraphy question, not a poetry question—it is
 hard to imagine how this aspect of the proposal would not
 result in an unbounded number of glyphic variations. 
 'e' is not the only letter to which calligraphers like to
 attach special endings, and a swash cross-stroke is not the
 only special ending that calligraphers like to attach to
 'e'.
 
 
It seems to me that there are at least two ways to have an alternate ending e. 
One is to extend the cross-stroke to the right beyond the e and end the 
extension with a flourish of some sort, another is to extend the lower line out 
to the right and end that extension in some way. I can imagine that a proposal 
would lead to wanting to be able to express a choice of the two, or more, 
possible variants of a letter, should the font have alternate glyphs of both 
types. Then there is the question of what is to happen if the requested one is 
not available in the font: does the other alternate glyph become displayed or 
does the basic character glyph become displayed?
 
 I'd like to see an FAQ page on What is Plain Text?
 written primarily by UTC officers.  That might go a
 long way toward resolving the differences between William's
 interpretation of what plain text is, which people like me
 think is too broad, and mine, which some people have said is
 too narrow.
 
That is a good idea.
 
Thank you also for the careful precision with which you describe the situation 
of who thinks what.
 
Yet is producing such a document an impossible task? Some years ago there was a 
suggestion in this mailing list to produce an Frequently Asked Questions (FAQ) 
page about what should not be encoded. Is the document that is now suggested 
effectively the same thing?
 
I thought of an analogy of trying to produce a FAQ document of What is art?. 
Such a document produced in 1550 might well have been very different from one 
produced in 1910, and those different from one produced in 1995 and those all 
different from one produced in 2010. Maybe the analogy is not perfect, but it 
seems to convey the meaning to me that if a What is Plain Text? document is 
produced, with a view to being able to decide what could and could not in the 
future be encoded in Unicode as plain text, then it could quickly become either 
out of date or a restriction of progress in technology. The recent encoding of 
the emoticons shows a dramatic change in what can be encoded as plain text from 
the situation some years ago. Some of my ideas have been refuted as not being 
suitable for encoding in plain text. Yet the refutation all seems to be based 
on unchangeable rules from about twenty years ago.
 
Yet change is part of progress.
 
I remember once being referred, in this mailing list, to an ISO document about 
encoding. The document made reference to a definition of character within the 
same document.
 
The document was ISO/IEC TR 15285.
 
I have found that the document is available here (the link used at the previous 
time no longer works).
 
http://openstandards.dk/jtc1/sc2/wg2/docs/TR%2015285%20-%20C027163e.pdf
 
The introduction includes the following.
 
quote
 
This Technical Report is written for a reader who is familiar with the work of 
SC 2 and SC 18. Readers without this background should first read Annex B, 
“Characters”, and Annex C, “Glyphs”.
 
end quote
 
Annex B has the following.
 
quote
 
In ISO/IEC 10646-1:1993, SC 2 defines a character as:
 
A member of a set of elements used for the organisation, control, and 
representation of data.
 
end quote
 
On the accessing of alternate glyphs from plain text, I feel that as there are 
256 variation selectors that could be used with each of the Latin letters, 
then, provided that no harm is done to those who choose not to use them, that 
some should be encoded so that alternate glyphs can be accessed from fonts.
 
Some readers might find the following of interest.
 
http://forum.high-logic.com/viewtopic.php?f=36t=2229
 
It is a thread entitled An unusual glyph of an Esperanto character in the Arno 
font.
 
I had been looking through the following document.
 
http://store1.adobe.com/type/browser/pdfs/ARNP/ArnoPro-Italic.pdf
 
I had found an alternate ending glyph for the h circumflex character and had 
then tried to produce some text where it could be used.
 
I felt that it was a situation of typography inspiring creative writing.
 
Readers who enjoyed that thread might also 

Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-09 Thread John H. Jenkins

On Aug 7, 2010, at 10:40 AM, Doug Ewell wrote:

 I'd like to see an FAQ page on What is Plain Text? written primarily by UTC 
 officers.  That might go a long way toward resolving the differences between 
 William's interpretation of what plain text is, which people like me think is 
 too broad, and mine, which some people have said is too narrow.
 

Well, we do have http://www.unicode.org/faq/ligature_digraph.html#10 and 
related FAQs?

The basic idea is that plain text is the minimum amount of information to 
process the given language in a normal way.  FOR EXAMPLE, ALTHOUGH ENGLISH 
CAN BE WRITTEN IN ALL-CAPS, IT USUALLY ISN'T, AND DOING IT LOOKS WRONG.  We 
therefore have both upper- and lower-case letters for English.  On the other 
hand, although English *is* usually written with some facility to provide 
emphasis, different media have different ways of providing that facility 
(asterisks, underlining, italicizing), and English written without any of these 
looks perfectly fine.  

Arabic, on the other hand, absolutely must have some way of allowing for 
different letter shapes in different contexts, or it looks just wrong, so 
Arabic plain text must have facility to allow for that, either by explicitly 
having different characters for the different shapes the letters take, or by 
providing a default layout algorithm that defines them.  

Beyond rendering, there are also considerations as to the minimal amount of 
information necessary for other text-based processes, such as sorting, 
searching, and text-to-speech.

Yes, there are issues which end up being judgment calls, and it's easy to come 
up with cases where you can't really capture the full semantic intent of the 
author without what Unicode calls rich text.  My favorite example is The 
Mouse's Tale in _Alice in Wonderland_.   Plain text isn't intended to capture 
all the nuances of the original's semantics, but to provide at the least a very 
close approximation.

Variation selectors are intended to cover cases where more information is 
needed for rendering than is required for other processes such as searching 
(Mongolian), or cases where different user communities disagree on whether two 
forms must be unified or must be deunified.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-09 Thread Jukka K. Korpela

John H. Jenkins wrote:


The basic idea is that plain text is the minimum amount of
information to process the given language in a normal way.


That's a bit vague. We don't normally process languages; we read texts. 
Whether font or color variation is essential for understanding really 
depends on the author's purposes and choices, not on language,



FOR
EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY
ISN'T, AND DOING IT LOOKS WRONG.


I wouldn't say it looks wrong. Surely it is often typographically poor or 
just stupid, but it might be a consequence of technical limitations (there 
are still loads of systems that make no case distinction in texts, so in any 
relevant aspect, they are effectively uppercase-only), and all-caps 
English is quite understandable, though boring to read, provided that some 
precautions are made by writers.



We therefore have both upper- and
lower-case letters for English.


It's just a distinction that you _can_ (and usually do) make in plain text 
English. It's not an inherent distinction: all-caps English is still 
English, though poorly written by modern standards.



Arabic, on the other hand, absolutely must have some way of allowing
for different letter shapes in different contexts, or it looks just
wrong, so Arabic plain text must have facility to allow for that,
either by explicitly having different characters for the different
shapes the letters take, or by providing a default layout algorithm
that defines them.


But layout algorithms are not part of character encoding or part of the 
definition of plain text. It's not OK to render plain text Arabic, encoded 
at logical level (i.e., letters encoded abstractly and not as contextual 
forms), in a simplistic manner that uses a one letter - one glyph model. But 
that's not part of the definition of plain text at all.



Yes, there are issues which end up being judgment calls, and it's
easy to come up with cases where you can't really capture the full
semantic intent of the author without what Unicode calls rich text.


We don't need to invent contrived examples for that. Every time an author 
uses italics or bolding to make an essential point in emphasizing something 
he does something that cannot be captured in a plain version of the text. To 
make an even simpler point, if you insert an essential content image into a 
document you step outside the realm of plain text.


I don't see any better definition for plain text than a negative one: it 
is text without formatting, except to the extent that forced line breaks and 
the choice of alternative forms for a character (to the extent that such 
differences are encoded in the character code) can be considered as 
formatting. Plain text, though apparently a very simple concept, is a very 
abstract one. I don't think you can explain the concept to your neighbor 
while standing on one foot, if at all.


Human writing did not originate as plain text, and at the surface level, it 
is never plain text: it always has some specific physical appearance, and 
abstract plain text can only be found below the surface, as the underlying 
data format where only character identities (character numbers in a specific 
code) are encoded, with no reference to a particular rendering.


--
Yucca, http://www.cs.tut.fi/~jkorpela/ 





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-08 Thread timpart
karl-pentz...@acssoft.de wrote:
 I have compiled a draft proposal:
 Proposal to add Variation Sequences for Latin and Cyrillic letters

There are 256 selectors but the proposal only suggests numbering up to 16 
effectively deprecating the others. Surely we want all 256?

The Mongolian selectors alter the appearance of the glyph displayed after the 
character has been evaluated for position in the word and a series of complex 
rules applied. The user will normally only have to use the selectors in 
exceptional cases. The selectors are only valid in certain positional cases and 
have been somewhat arbitarily assigned. It is not the case that selector 1 
selects the same alternative form in all positions.

A typical user will see most of the variations in use from the built in rules 
being applied. There is not a user entity which would be considered variant 1 
which is used by a separate community. I regard to proposal to give a name like 
VARIant-M1 as confusing as they have no basis in reality

I am also have some concerns from a security point of view as the proposal 
makes variation selectors valid for Latin characters for the first time. The 
selectors which produce a default behaviour or make one character look like 
another already encoded seem unneeded and introduce yet more clones of common 
characters. 

I also have concerns about the proposal to give the non ideographic variants 
names like VARIANT-1. Surely it is possible to give them descriptive names 
which would make it easier to understand what is meant? It is not as if we will 
have thousands of these.

Some parts of the proposal have merit, but I would urge the UTC to hold a 
public consultation on the matter to allow more time for feedback to be 
gathered.

Tim Partridge
 
 
 




Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-07 Thread William_J_G Overington
Thank you for replying.
 
On Friday 6 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
 
 What you mean are artistic or stylistic variants.
 
 These have certain problems, see here for an explanation:
 http://www.unicode.org/forum/viewtopic.php?p=221#p221
 
 A./
     
 
I have read and reread the forum post to which you refer.
 
I cannot understand from that text, or otherwise at the time of writing this 
reply, why it would not be possible to have an alternate ending glyph for a 
letter e accessible from plain text using an advanced font technology font (for 
example, an OpenType font) using the two character sequence U+0065 U+FE0F.
 
The specific design of an alternate ending e glyph would vary from font to 
font, yet that it is an alternate ending e would be clear: the encoding U+0065 
U+FE0F would allow the intention that an alternate ending glyph for a letter e 
is requested to be carried within a plain text document.
 
I accept that I might be missing something here. If so I would be happy to 
learn: at the moment, however, it still seems to me to be a good idea for an 
encoding.
 
William Overington
 
7 August 2010





Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-07 Thread William_J_G Overington
Thank you for replying.
 
On Friday 6 August 2010, John H. Jenkins jenk...@apple.com wrote:
 
 This is another case of a solution in search of a problem.
 
No, the problem is that one cannot at present, as far as I know, access 
alternate glyphs of an advanced format font from a plain text file.
 
 It isn't Unicode's business to advance typography, and in any event, 
 typesetting plain text isn't the path to good typography.
 
Those are interesting claims.
 
I hope that if Unicode can advance typography by providing a facility such as I 
am suggesting that it would be pleased to do so.
 
 Other technologies, such as OpenType, AAT, and Graphite, *do* have the job of 
 making good typography easy and accessible.
 
Fonts are an important part of the whole process.
 
 And, mirabile dictu, they can already do what you are suggesting here for 
 plain text.
 
I am unaware of how an application program using an OpenType font can be made 
to display alternate glyphs requested from a plain text file. Can it be done?
 
 Unicode's responsibility is to deal with existing needs.
 
Well, for me it is a need to be able to request the display of an alternate 
glyph of an advanced format font from a plain text file.
 
 If it is common for poets to use various letter shapes at the end of words to 
 convey some semantic meaning, and if they do this in their emails or tweets, 
 or if they're complaining that this is something that they want to do but 
 can't, then Unicode and plain text provide a proper way to help them.
 
Alas, a paradox. If the facility becomes available, they might well use it. 
Yet, unlike a ROASTED SWEET POTATO glyph becoming available on some mobile 
telephones then later becoming encoded in Unicode because it was available on 
some mobile telephones, it is not, as far as I am presently aware, possible for 
that to happen in relation to requesting an alternate ending glyph for a letter 
e from a plain text file whilst still producing an ordinary e if that request 
cannot be fulfilled by the particular font being used.  
 
Fonts themselves are used to convey semantic meaning. I am unsure of quite how 
it all works, yet it seems to work partly by association with cultural 
knowledge of where fonts or handwriting or signwriting of that type have been 
used previously and partly with design aspects of the font, such as angularity 
or smoothness or ornateness and perhaps other factors as well.
 
William Overington
 
7 August 2010
 





Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-07 Thread Doug Ewell
William_J_G Overington wjgo underscore 10009 at btinternet dot com 
wrote:


I cannot understand from that text, or otherwise at the time of 
writing this reply, why it would not be possible to have an alternate 
ending glyph for a letter e accessible from plain text using an 
advanced font technology font (for example, an OpenType font) using 
the two character sequence U+0065 U+FE0F.


The specific design of an alternate ending e glyph would vary from 
font to font, yet that it is an alternate ending e would be clear: the 
encoding U+0065 U+FE0F would allow the intention that an alternate 
ending glyph for a letter e is requested to be carried within a plain 
text document.


I think the alternate ending glyph is supposed to be specified in more 
detail than that.  The example Asmus gave was U+222A UNION with serifs. 
Even though the exact proportions of the serifs may differ from one font 
to the next, this is still a relatively precise and constrained 
definition, unlike Latin small letter e with some 'alternate ending' 
which is completely up to the discretion of the font designer.


Because of stylistic differences among calligraphers—this is a 
calligraphy question, not a poetry question—it is hard to imagine how 
this aspect of the proposal would not result in an unbounded number of 
glyphic variations.  'e' is not the only letter to which calligraphers 
like to attach special endings, and a swash cross-stroke is not the only 
special ending that calligraphers like to attach to 'e'.


I'd like to see an FAQ page on What is Plain Text? written primarily 
by UTC officers.  That might go a long way toward resolving the 
differences between William's interpretation of what plain text is, 
which people like me think is too broad, and mine, which some people 
have said is too narrow.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic letters

2010-08-07 Thread verdy_p
Michael Everson 
 On 6 Aug 2010, at 22:20, Karl Pentzlin wrote:
 
  Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson:
  
  ME ... In particular the implications
  ME for Serbian orthography would be most unwelcome.
  
  As I have outlined in the revised introduction of my proposal,
  there are *no* implications for Serbian orthography.
  Admittedly, this was a little bit implicit in my first draft.
 
 Yeah, well, I am not convinced of the merits of your proposal. Sorry.

I am not convinced too. Because all what this proposal is supposed to solve is 
to allow an automted change of 
orthography so that SOME long s in old doucments using Fraktur style will 
become round s in some other antermediate 
style (like Antiqua) and then all of them will become round s later.

It's a matter of orthographic adaptation, i.e. modernization of old texts. But 
any modernization of old 
orthographies imples more than just changing some glyphs. For example the 
modernisation of medieval French texts 
require knowing when it was written (to correctly infer its semantic), then 
knowing for which period of time the 
modernized version was created, and then knowing what other orthographic 
changes where necessary, such as 
substituting s (long or round) into circumflexes, or changing tildes into 
circumflex or newer (distinguished) 
modern accents, or dropping some other letters.

Unicode is not made to adapt to orthographic changes. My opinion is that it 
just has to encode the orthography, AS 
IT IS, ignoring all possible other adaptation due to modernizations (and 
evolutions of the written language).

In other words, the existing long s and common round s is just enoiugh to 
preserve the original orthography and 
its semantics, as they were in the original text (even if it was ambiguous or 
incoherent). The variation selectors 
are not intended to convey the additional semantics needed for adaptations to 
newer orthographies, but ONLY the 
additional semantics that exist in a written language at the time when it was 
effectively written.

Text modernizers will really need something else, notably lexical and 
gramatical analysis (within humane 
supervision), and they are completely out of scope of Unicode and ISO 10646. 
These will work by effectively 
correcting the text, i.e. changing its original orthography and semantics. This 
process will be mostly like many 
transliterations schemes or like all translations processes: the resulting text 
is obsiously different and intended 
for different readers.

The only case where we really need variation selectors is when we can 
demonstrate that there are opposable pairs 
where a glyphic variant (within a unified abstract character) in the SAME text 
by the SAME author conveys a distinct 
semantic. For everything else, variations selectors should not be used at all, 
and a encoded round s will still 
mean the same, even if it's renderered with a Fraktur font or a Bodoni- or 
Antiqua- like font.

Philippe.



Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic letters

2010-08-07 Thread Doug Ewell

verdy_p verdy underscore p at wanadoo dot fr wrote:

I am not convinced too. Because all what this proposal is supposed to 
solve is to allow an automted change of orthography so that SOME long 
s in old doucments using Fraktur style will become round s in some 
other antermediate style (like Antiqua) and then all of them will 
become round s later.


You missed some e-mails.  The long s/round s sequences are gone from the 
latest proposal.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­






Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread William_J_G Overington
On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote:
 
  I am thinking of where a poet might specify an ending version of a glyph at 
  the end of the last word on some lines, yet not on others, for poetic 
  effect. I think that it would be good if one could specify that in plain 
  text.
 
 Why can't a poet find a poetic means of doing that, instead of depending on a 
 standards organization to provide a standard means of doing so in plain text? 
 Seems kind of anti-poetic to me. ;-)
 
 --Ken
 
Well, I was just suggesting an example. I am not an expert on poetry.
 
It would not be a matter of a poet depending on a standards organization, it 
would be a matter of a standards organization noting that adding alternate 
glyphs to fonts is a modern trend and doing what it can to facilitate access to 
those alternate glyphs from plain text in a standardized way.
 
For example, suppose that an alternate ending glyph for a letter e is desired 
at the end of a line of a poem. I am thinking that U+0065 U+FE0F could be used 
to do that.
 
It seems to me that as U+0065 U+FE0F is presently unused and that there are 
also other variation selectors not used with U+0065, that it would do no harm 
and would be useful for U+0065 U+FE0F to be officially standardized as 
requesting an alternate ending glyph for a letter e, yet using the ordinary 
glyph of U+0065 of the font if an alternate ending glyph of the letter e is not 
available within the font.
 
The standards organizations have a great opportunity to advance typography by 
defining some of the Latin letter plus variation selector pairs so that 
alternate glyphs within a font may be accessed directly from plain text.
  
William Overington
 
6 August 2010
 






Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Martin J. Dürst



On 2010/08/05 2:56, Asmus Freytag wrote:

On 8/2/2010 5:04 PM, Karl Pentzlin wrote:

I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted for the next UTC
starting next Monday (August 9).

Any comments are welcome.

- Karl Pentzlin


This is an interesting proposal to deal with the glyph selection problem
caused by the unification process inherent in character encoding.

When Unicode was first contemplated, the web did not exist and the
expectation was that it would nearly always be possible to specify the
font to be used for a given text and that selecting a font would give
the correct glyph.


The Web may finally get to solve this problem, although it may still 
take some time to be fully deployed. Please see http://www.w3.org/Fonts/ 
for more details and pointers.


Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp



Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread Asmus Freytag

On 8/6/2010 2:03 AM, William_J_G Overington wrote:

On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote:
 
  

I am thinking of where a poet might specify an ending version of a glyph at the 
end of the last word on some lines, yet not on others, for poetic effect. I 
think that it would be good if one could specify that in plain text.
  
 
  

Why can't a poet find a poetic means of doing that, instead of depending on a 
standards organization to provide a standard means of doing so in plain text? 
Seems kind of anti-poetic to me. ;-)

 
  

--Ken

 
Well, I was just suggesting an example. I am not an expert on poetry.
  

What you mean are artistic or stylistic variants.

These have certain problems, see here for an explanation: 
http://www.unicode.org/forum/viewtopic.php?p=221#p221


A./
 
  





Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread John H. Jenkins

On Aug 6, 2010, at 3:03 AM, William_J_G Overington wrote:

 The standards organizations have a great opportunity to advance typography by 
 defining some of the Latin letter plus variation selector pairs so that 
 alternate glyphs within a font may be accessed directly from plain text.
 

This is another case of a solution in search of a problem.  It isn't Unicode's 
business to advance typography, and in any event, typesetting plain text isn't 
the path to good typography.  Other technologies, such as OpenType, AAT, and 
Graphite, *do* have the job of making good typography easy and accessible.  
And, mirabile dictu, they can already do what you are suggesting here for plain 
text.  

Unicode's responsibility is to deal with existing needs.  If it is common for 
poets to use various letter shapes at the end of words to convey some semantic 
meaning, and if they do this in their emails or tweets, or if they're 
complaining that this is something that they want to do but can't, then Unicode 
and plain text provide a proper way to help them.  

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Karl Pentzlin
Am Dienstag, 3. August 2010 um 02:04 schrieb ich:

KP I have compiled a draft proposal:
KP Proposal to add Variation Sequences for Latin and Cyrillic letters

In the meantime, I have submitted a final version to the UTC
(L2/10-280), as the UTC starts upcoming Monday (2010-08-09).
For those who do not have access to L2, it is also found at:
http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic.pdf (4.4 MB).

Thank you to all who participated on the discussions on this list.
According to your hints, I have:
· dropped the proposed variants for Latin small letter s
  (addressing Fraktur/Blackletter), as the special aspects of these
  are to be handled in a separate proposal (if such will be done).
· dropped the unspecific variants for Latin small letter a and g,
· rewritten substantial parts of the introduction, to be more concise
  at the points which had raised questions on this list and elsewhere.

- Karl Pentzlin






Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Karl Pentzlin
Am Freitag, 6. August 2010 um 11:08 schrieb Martin J. Dürst:

MJD The Web may finally get to solve this problem, although it may still
MJD take some time to be fully deployed. Please see http://www.w3.org/Fonts/
MJD for more details and pointers.

Variation sequences are a means to support this goal, as they provide
font developers with a standardized and easy understandable means,
which unburdens the font designers as well as the site designers who
decide which font they offer for their intended users of their content.

- Karl Pentzlin





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Karl Pentzlin
Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson:

ME ... In particular the implications
ME for Serbian orthography would be most unwelcome.

As I have outlined in the revised introduction of my proposal,
there are *no* implications for Serbian orthography.
Admittedly, this was a little bit implicit in my first draft.

- Karl Pentzlin







Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Karl Pentzlin
Am Donnerstag, 5. August 2010 um 12:31 schrieb William_J_G Overington:

WO Yet what if one wants to use the precomposed g circumflex character?

To search in the text of the Unicode standard for canonical
equivalence is helpful in this case for end users as well as for font
designers and for programmers of rendering systems.

- Karl Pentzlin




Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread Karl Pentzlin
Am Mittwoch, 4. August 2010 um 22:44 schrieb ich:

KP However, in my next version, I will replace the s variants by long s 
variants:
KP 017F FE00 ...LONG S VARIANT-1  ... STANDARD FORM
KP  · will be displayed long in any script variants
KP 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally)
KP  · will be displayed long in Fraktur, Gaelic, and similar script 
variants
KP  · will usually be displayed round when used with Roman type
KP This has the advantage, especially when implicit application of variation 
sequences
KP is possible, it can be applied to existing data without change.

In the final version of my proposal, I have completely dropped this,
as this subject obviously needs a separate discussion in a separate proposal.

- Karl Pentzlin





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Michael Everson
Yeah, well, I am not convinced of the merits of your proposal. Sorry.


On 6 Aug 2010, at 22:20, Karl Pentzlin wrote:

 Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson:
 
 ME ... In particular the implications
 ME for Serbian orthography would be most unwelcome.
 
 As I have outlined in the revised introduction of my proposal,
 there are *no* implications for Serbian orthography.
 Admittedly, this was a little bit implicit in my first draft.
 
 - Karl Pentzlin
 
 
 
 
 

Michael Everson * http://www.evertype.com/




Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-05 Thread André Szabolcs Szelp
For the standard form you probably don't need to add a variation selector.
The codepoint for long s itself expresses exactly the semantic to represent
this character as long s in ANY type style.

While I'm not convinced of your variation proposal at all (on the contrary),
if you write it, write it properly. :-)

/Sz

2010/8/4 Karl Pentzlin karl-pentz...@acssoft.de

 Am Dienstag, 3. August 2010 um 19:11 schrieb Janusz S. Bień:

 JJSB I see no reason why, if I understand correctly, the long s variant is
 JSB to be limited to Fraktur-like styles.

 The *variant* is applicable to situations where the character is to be
 displayed long when Fraktur-like styles are in effect, while it is to
 be displayed round when modern styles are in effect.

 The plain *character* long s is intended to be displayed long in all
 circumstances.

 However, in my next version, I will replace the s variants by long s
 variants:
 017F FE00 ...LONG S VARIANT-1 STANDARD FORM
 · will be displayed long in any script variants
 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally)
 · will be displayed long in Fraktur, Gaelic, and similar script
 variants
 · will usually be displayed round when used with Roman type
 This has the advantage, especially when implicit application of variation
 sequences
 is possible, it can be applied to existing data without change.

 - Karl Pentzlin





-- 
Szelp, André Szabolcs

+43 (650) 79 22 400


Re: Dialects and orthographies in BCP 47 (was: Re: Draft ProposalDA to add Variation Sequences for Latin and Cyrillic letters)

2010-08-05 Thread André Szabolcs Szelp
will decide to reunite their cultural efforts [...] and increasing their
mutual cultural exchanges instead of wasting them for old nationalist
reasons

You're either an utmost optimist, or you have really no idea of Eastern
European history, culture and spirit. :-)

I doubt your described scenario will come true in our lifetimes.

/Sz

On Wed, Aug 4, 2010 at 11:10 PM, verdy_p verd...@wanadoo.fr wrote:

 Doug Ewell  wrote:
  There is no formal model in the sense of a standard N-letter subtag
  for dialects, because the concept of a dialect is too open-ended and
  unsystematic. The word means different things to different people.
  What may be a dialect to one person might be a full-blown National
  Language to another, or just a funny accent to a third.

 The formal model already exists in ISO 639, that has decided to unify all
 dialectal variants under the same language
 code. Yes the concept is fuzzy, but as long as ISO 639 will not contain a
 formal model for how the various languages
 are grouped in families and subfamilies, it will be impossible to use
 dialectal variant specifiers with accurate
 fallbacks, without using subtags for the language variants.

 One know problem is for exampel Norman, which ISO 639 still considers as a
 dialect of French, even though it is just
 ANOTHER Oil language (from which Standard French emerged by merging,
 modifying and extending several dialects).

 But Jersiais is now an language with official in Jersey, which is clearly
 part of the Norman family. And that still
 needs to be distinguished from French. Still, there's no ISO 639 code for
 Norman (as a family or as the residual
 language in continentla Normandy in France), and no code for Jersiais as
 well. And French is considered in ISO 639
 as an isolated language, not as as macrolanguage. So it allows no
 further precision.

 If something is added, it can only be a variant for the dialectal
 difference, such as fr-norman for the Norman
 family, or fr-jersiais for Jersiais, unless Jersiais gets its own ISO
 639-3 code as an isolated language (leaving
 the continental Norman still as a dialectal variant of French).

 The formal definition of languages is the definition of ISO 639-3
 isolated languages. Everything below is
 dialectal (and ISO 639 has clearly stated that it planned for much later a
 comprehensive encoding of dialectal
 differences, most probably by defining a standard list of variant codes,
 even if these dialects may qualify as
 languages for some users)

 

 It's remarkable that for most linguists, Serbian, Croatian, annd Bosnian
 are only one language, with only dialectal
 differences (in the spoken language and with some grammatical derivations,
 and some minor lexical differences that
 are understood by all Serbo-Croatian speakers), orthographic differences
 (mostly based on their default script, even
 if Serbian still uses the two scripts but it defines a strict
 transliteration system that helps defining a unified
 orthography for both scripts, orthographies that are simplified in Croatian
 and Bosnian).

 So yes, the concept of dialects vs. language is fuzzy for linguists and
 users (and nationals that prefer to see
 their dialect named from their country as a full language instead of a
 dialect), but ISO 639 defines a formal model
 by its technical encoding: if there's an authority defending the position
 of a distinct language and defining an
 official lexique and orthography, it becomes a de facto language for ISO
 639.

 Such split of languages in their dialectal differences promoted to isolated
 languages has occured and was endorsed
 by ISO 639, even if it was probably not in the interest of these countries
 to split their common language and to
 reduce its audience and cultural influence in other parts of the world (and
 for many of their own citizens, they
 won't care a lot about these formal official differences, as long as they
 understand it and can read and write it in
 a script that they can decipher it without difficulties, only because they
 will constantly live near other peoples
 sharing the same language but under a different name).

 Serbian is still perceived and encoded as a single language, despite it
 still uses two scripts, depending on the
 region of use (but it is now rapidly converging to the Latin script). May
 be the linguistic and cultural authorities
 of the four concerned countries (or five, now with Kosovo whose
 independance was recently validated by a
 international court?) will decide to reunite their cultural efforts, if
 they finally all use the same Latin script,
 by adopting a new neutral name (Dolmoslavic, Adriatic, Adrislavic ? Or even
 Yugoslavic ?) and increasing their
 mutual cultural exchanges instead of wasting them for old nationalist
 reasons (this will be even more important when
 they will finally ALL join the European Union with increased exchanged
 between them).

 Philippe.




Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread William_J_G Overington
Thank you for your reply.
 
On Wednesday 4 August 2010, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 
 WO Why is it not possible specifically to request a one-storey form of 
 lowercase letter a?
 
 I did not this, as I do not know a cultural context where the two-storey form 
 is to be suppressed to prevent an a to be mistaken for any letter too 
 similar to a two-storey a.
 
Well, I was intending this as a straightforward way to access glyph alternates.
 
Noticing that you mentioned cultural context, I have now remembered a situation 
that might perhaps be of interest.
 
It was in a thread about fonts for teaching children in the United Kingdom how 
to read and write.
 
http://forum.high-logic.com/viewtopic.php?f=10t=296
 
 WO What happens in relation to a character such as g circumflex? Would one 
 be able to access a glyph alternate for g circumflex?
 
 The variant selector can be followed by any diacritic which then is applied 
 to the base character.
 
Yet what if one wants to use the precomposed g circumflex character?
 
 WO Could there be variants for lowercase e, ...
 
 I have found none, which of course is no proof of
 non-existence,
 
 WO for a horizontal line glyph design, and for an
 angled line,
 
 Not according to the principles outlined in my proposal,
 
 WO  Venetian-style font, glyph design please?
 
 No.
 
I was looking for a way to access a glyph alternate for typography, not for any 
cultural meaning. Maybe one might choose to use an e with an angled line in the 
words Venice and Venetian, for subtle effect in the typography. I find that 
adding alternate glyphs to fonts is a modern trend. There seems no current way 
to access them from plain text.
 
 WO Would it be possible to define U+FE15 VARIATION SELECTOR-16 to indicate 
 an end of word alternate glyph for each lowercase Latin character?
 
 No. Even if you find a cultural context where such things are required, such 
 things are positional variants which are to be handled by the proven 
 mechanisms developed for scripts like Arabic.
 
I am thinking of where a poet might specify an ending version of a glyph at the 
end of the last word on some lines, yet not on others, for poetic effect. I 
think that it would be good if one could specify that in plain text.
  
William Overington
 
5 August 2010
 





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread William_J_G Overington
On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
 
 However, there's no need to add variation sequences to
 select an *ambiguous* form. Those sequences should be
 removed from the proposal.
 
Are you here talking about such things as alternate glyph styles?
 
It depends what one means by need.
 
Adding alternate glyphs to a font is a trend in modern font design.
 
One approach is to use Private Use Area mappings, which can be used to produce 
stylish hardcopy printouts and stylish graphics for the web, yet there are the 
well-known problems of spell-checking and so on if Private Use Area mappings 
are used for much more than those application areas.
 
The other approach is to use an alternate glyph model, where the underlying 
plain text is conserved. However, this, today, often means using expensive 
software packages with a proprietary file format in order to store the 
information about which glyph to use in each case.
 
I remember those advertisements that CNN used to run promoting the concept of 
advertising. Advertising - your right to choose. One of the advertisements 
distinguished between what people need and what people want.
 
So, maybe people do not need to use alternate glyphs in typography, yet maybe 
they want to do so, maybe they enjoy doing so.
 
I feel that it is entirely reasonable that Unicode and ISO 10646 encode things 
that help people do what they want to do and what they enjoy doing as well as 
what they need to do.
 
William Overington
 
5 August 2010





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread Asmus Freytag

On 8/5/2010 3:47 AM, William_J_G Overington wrote:

On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
 
  

However, there's no need to add variation sequences to
select an *ambiguous* form. Those sequences should be
removed from the proposal.

 
Are you here talking about such things as alternate glyph styles?
  
No, I am referring to the element of the proposal that proposes to have 
a variation sequence that selects the unspecified form for lower case a.
 
It depends what one means by need.
  
I've written a longer answer here: 
http://www.unicode.org/forum/viewtopic.php?f=9t=83start=0


A./
 
  





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread Kenneth Whistler

 I am thinking of where a poet might specify an ending version 
 of a glyph at the end of the last word on some lines, yet not 
 on others, for poetic effect. I think that it would be good 
 if one could specify that in plain text.

Why can't a poet find a poetic means of doing that, instead of
depending on a standards organization to provide a standard
means of doing so in plain text? Seems kind of anti-poetic to me. ;-)

--Ken




Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread William_J_G Overington
On Tuesday 3 August 2010, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 
 Any comments are welcome.
 
Firstly, thank you for making the document available.
 
I have made a few comments regarding matters that I noticed.
 
Please know that, whilst I comment on various matters, I am enthusiastic for 
the general thrust of your suggestion regarding access to alternate glyphs for 
Latin characters using Variation Selectors. This could produce a renaissance 
for typography.
 
In the document, on page 2, there is the following.
 
quote
 
But while the general mechanisms for doing so are standardized (i.e. OpenType 
features), the concrete selection of a specific glyph is not.
 
end quote
 
It is important that the Unicode specification does not regard any particular 
font technology as being the standard font technology.
 
This issue was discussed in this mailing list some years ago.
 
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0106.html
 
The last two paragraphs of the following post put that post in context.
 
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0095.html
 
Why is it not possible specifically to request a one-storey form of lowercase 
letter a?
 
It seems to me that being able to request either a one-storey form or a 
two-storey form of lowercase letter a would be better.
 
In relation to lowercase g, would it be better to be able to request any one of 
open descender, closed loop descender and unclosed loop descender?
 
For example, the lowercase letters g in the fonts Arial, Times New Roman and 
Trebuchet MS show the three types.
 
What happens in relation to a character such as g circumflex? Would one be able 
to access a glyph alternate for g circumflex?
 
Could there be variants for lowercase e, for a horizontal line glyph design and 
for an angled line, Venetian-style font, glyph design please?
 
Would it be possible to define U+FE15 VARIATION SELECTOR-16 to indicate an end 
of word alternate glyph for each lowercase Latin character? Certainly, some 
usages would be more likely than others, with d, e, h, m, n, t, z being more 
likely to have an end of word alternate glyph than would some other letters, 
yet a general usage for all Latin characters would, in my opinion, be good.
 
William Overington
 
4 August 2010






Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread William_J_G Overington
On Tuesday, 3/8/10, Janusz S. Bień jsb...@mimuw.edu.pl wrote:
 
 I see no reason why, if I understand correctly, the long s
 variant is to be limited to Fraktur-like styles.
 
Long s was used with ordinary Roman type in England for English text in at 
least part of the 17th and 18th centuries.
 
How could one express the following please using variation selectors and the 
Zero Width Joiner ZWJ in relation to the two character sequence sh?
 
If you have a long s available, please use it, otherwise please use an ordinary 
s: furthermore, if you have a long s h ligature available please use that 
instead.
 
How could one express the following please using variation selectors and the 
Zero Width Joiner ZWJ in relation to the three character sequence ssi?
 
If you have a long s available, please use it, otherwise please use an ordinary 
s: furthermore, if you have a long s long s i ligature available please use 
that instead.
 
William Overington
 
4 August 2010









Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread Andrew West
On 4 August 2010 09:19, William_J_G Overington
wjgo_10...@btinternet.com wrote:

Answering the two questions below on the assumption that s-VS1 0073
FE00 were to be defined as a variation sequence for long s in all
type styles, and without giving any opinion on the merits or otherwise
of Karl's proposal in general, or specifically the merits of
double-encoding long s as a variation sequence.

 How could one express the following please using variation selectors and the 
 Zero Width Joiner ZWJ in relation to the two character sequence sh?

 If you have a long s available, please use it, otherwise please use an 
 ordinary s: furthermore, if you have a long s h ligature available please use 
 that instead.

s-VS1-ZWJ-h

Note that there must be no character between a variation selector and
the base character it applies to, so the ZWJ must go after VS1.

 How could one express the following please using variation selectors and the 
 Zero Width Joiner ZWJ in relation to the three character sequence ssi?

 If you have a long s available, please use it, otherwise please use an 
 ordinary s: furthermore, if you have a long s long s i ligature available 
 please use that instead.

The use of long s versus short s and ligaturing of these letters
varies widely geographically and historically and depending upon
typeface. The following examples would all be valid *if* s-VS1 were to
be defined as a variation sequence for long s (in all type styles):

s-VS1-ZWJ-s-VS1-ZWJ-i -- for a ligatured ſſi as in miſſion (usual in
18th century English typography)
s-VS1-s-i -- for a non-ligatured ſsi as in illuſtriſsimos (usual in
18th century Spanish typography)
s-VS1-ZWJ-s-i -- for a ligatured ſs plus i as in bleſsings (usual
for italics only in 16th and early 17th century English and French
typography)
s-s-VS1-ZWJ-i -- for s plus a ligatured ſi as in utilisſima
(sometimes in 16th century Italian typography)

Andrew




Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Andreas Stötzner


Am 03.08.2010 um 02:47 schrieb David Starner:


Fraktur and Antiqua are different writing
systems with slightly different orthographies


No. Fraktur and Antiqua are two (of many) different renderings of the 
Latin writing system.


Regards,
A. Stötzner.




Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread Leonardo Boiko
On Wed, Aug 4, 2010 at 05:19, William_J_G Overington
 Long s was used with ordinary Roman type in England for English text in at 
 least part of the 17th and 18th centuries.

More on that by babelstone:
http://babelstone.blogspot.com/2006/06/rules-for-long-s.html

(Sorry for the duplicate email William, my mistake.)

-- 
Leonardo Boiko



Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic =9letters (was Re: long s (was: Draft Proposal to add Variation=D=A Sequences for =9Latin and Cyrillic letters))

2010-08-04 Thread verdy_p
In my opinion, adding the s+VS1 variation sequence is completely unneeded. If 
you really want a long s, use the code 
assigned to the long s. fonts or renderers should still provide a reasonnable 
fallback to s if the glyph is missing.

This means that all existing ligatures will long s will continue to be encoded 
as well with long s and ZWJ. the 
x+VS1 proposal is an attempt to disunify the long s, when it is NOT needed 
at all.

The only convenient variation sequence would be to add S+VS1 for the capital 
(because long s has no capital) only to 
preserve the long s semantic when converting it to uppercase or titlecase, in 
which case the mapping of S+VS1 to 
lowercase will give again the standard long s.



Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread John W Kennedy

On Aug 4, 2010, at 8:20 AM, Andreas Stötzner wrote:

 
 Am 03.08.2010 um 02:47 schrieb David Starner:
 
 Fraktur and Antiqua are different writing
 systems with slightly different orthographies
 
 No. Fraktur and Antiqua are two (of many) different renderings of the Latin 
 writing system.

The two propositions are not mutually exclusive. And it /is/ true that, at 
least at some times, Fraktur and Antiqua have had different orthographies.

-- 
John W Kennedy
There are those who argue that everything breaks even in this old dump of a 
world of ours. I suppose these ginks who argue that way hold that because the 
rich man gets ice in the summer and the poor man gets it in the winter things 
are breaking even for both. Maybe so, but I'll swear I can't see it that way.
  -- The last words of Bat Masterson







Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Asmus Freytag

On 8/2/2010 5:04 PM, Karl Pentzlin wrote:

I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
 http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted for the next UTC
starting next Monday (August 9).

Any comments are welcome.

- Karl Pentzlin

  
This is an interesting proposal to deal with the glyph selection problem 
caused by the unification process inherent in character encoding.


When Unicode was first contemplated, the web did not exist and the 
expectation was that it would nearly always be possible to specify the 
font to be used for a given text and that selecting a font would give 
the correct glyph.


As the proposal noted, universal fonts and viewing documents on other 
platforms and systems across the web have made this solution 
unattractive for general texts.


We are left then with these five scenarios

1) Free variation
2) Orthographic variation of isolated characters (by language, e.g. 
different capitals)
3) Orthographic variation of entire texts (e.g. italic Cyrillic forms, 
by language)

4) Orthographic variation by type style (e.g. Fraktur conventions)
5) Notational conventions (e.g. IPA)

For free variation of a glyph, the only possible solutions are either 
font selection or use of a variation sequence. I concur with Karl, that 
in this case, where notable variations have been unified, that adding 
variation selectors is a much more viable means of controlling authorial 
intent than font selection.


If text is language tagged, then Opentype mechanisms exist  in principle 
to handle scenario 2 and 3. For full texts in a certain language, using 
variation selectors throughout is unappealing as a solution.


However, it may be a viable solution for being able to embed correctly 
rendered citations in other text, given that language tagging can be 
separated from the document and that automatic language tagging may 
detect large chunks of text, but not short runs.


The Fraktur problem is one where one typestyle requires additional 
information (e.g. when to select long s) that is not required for 
rendering the same text in another typestyle. If it is indeed desirable 
(and possible) to create a correctly encoded string that can be rendered 
without further change automatically in both typestyles, then adding any 
necessary variation sequences to ensure that ability might be useful. 
However, that needs to be addressed in the context of a precise 
specification of how to encode texts so that they are dual renderable. 
Only addressing some isolated variation sequences makes no sense.


Notational conventions are addressed in Unicode by duplicate encoding 
(IPA) or by variation sequences. The scheme has holes, in that it is not 
possible in a few cases to select one of the variants explicitly, 
instead, the ambiguous form has to be used, in the hope that a font is 
used that will have the proper variant in place for the ambiguous form.


Adding a few variation sequences (like the one to allow the a at 0061 
to be the two story one needed for IPA) would fill the gap for times 
when controlling the precise display font is not available.


However, there's no need to add variation sequences to select an 
*ambiguous* form. Those sequences should be removed from the proposal.


Overall a valuable starting point for a necessary discussion.

A./



Re: Draft Proposal to add Variation Se=D=A quences for Latin and Cyrillic  letters

2010-08-04 Thread verdy_p
John W Kennedy wrote:
 On Aug 4, 2010, at 8:20 AM, Andreas Stötzner wrote:
  Am 03.08.2010 um 02:47 schrieb David Starner:
  Fraktur and Antiqua are different writing
  systems with slightly different orthographies
  
  No. Fraktur and Antiqua are two (of many) different renderings of the Latin 
  writing system.
 
 The two propositions are not mutually exclusive. And it /is/ true that, at 
 least at some times, Fraktur and 
Antiqua have had different orthographies.

And it is probably the main reason of the inclusion of Latf in ISO 15924, not 
just because it is a script variant, 
but really because it defines a distinct orthography, which should be 
specifiable in BCP 47 language tags.

I think you could apply the same rationale on Hans and Hant as well (not 
really a different script for the UCS, 
but distinct orthographies.) 

Really, Hans, Hant, Latf, Latg could have been avoided in ISO 15924, if 
orthographic variants of the same 
languages had been encoded in the IANA database for BCP 47, independantly of 
the effective font style.

But for now there's still no formal model for encoding language dialects, so 
BCP 47 language tags still need to use 
tags for ISO 3166-1 region codes and for the script variant, when it should 
just qualify the generic script code (or 
it could even drop this ISO 15924 code if there was a formal code for the 
dialect written in a specific orthography: 
we would also deprecate Jpan, Hrkt in ISO 15924).

Orthographic variants would include also:
- the various romanization systems (for example Pinyin) and phonetic 
transcriptions (IPA phonetic, simplified IPA 
phonology),
- the simplified orthographies (e.g. orthographic reforms in French and German),
- and some other minor variants (like the vertical presentation for East-Asian 
scripts, or Boustrophedon 
presentation for Ancient Greek, if this alters the orientation of characters 
that had to be encoded differently, and 
the default mirroring properties are not applicable to the encoded characters 
in the basic language).

For now these dialectal/orthographic variants of written languages can be 
registered in the IANA database for BCP 
47, using codes with at least 5 letters (or with at least 4 letters or digits 
if there's at least one digit), but 
ideally the dialectal variant should be encoded as a tag BEFORE the 
orthographic variant.

The font style prefered for each orthographic variant is still left to the 
rendering system that will apply 
stylesheets according to the language tag. It should not be invalid to use a 
fallback style that will ignore the 
orthographic variants for which there's no font support or no support in the 
font rendering system or page layout 
system.

Philippe.




Dialects and orthographies in BCP 47 (was: Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Doug Ewell
verdy_p verdy underscore p at wanadoo dot fr wrote:

 Really, Hans, Hant, Latf, Latg could have been avoided in ISO 15924, 
 if orthographic variants of the same 
 languages had been encoded in the IANA database for BCP 47, independantly of 
 the effective font style.

Actually it was the opposite; the ability to use standardized ISO 15924
code elements to express concepts like Simplified Han was one of the
driving forces behind RFC 4646 and its shift in focus from whole tags to
subtags.

In any case, the bibliographers and others who use ISO 15924 but not BCP
47 might need to make these distinctions as well.

 But for now there's still no formal model for encoding language dialects, so 
 BCP 47 language tags still need to use 
 tags for ISO 3166-1 region codes and for the script variant, when it should 
 just qualify the generic script code (or 
 it could even drop this ISO 15924 code if there was a formal code for the 
 dialect written in a specific orthography: 
 we would also deprecate Jpan, Hrkt in ISO 15924).

There is no formal model in the sense of a standard N-letter subtag
for dialects, because the concept of a dialect is too open-ended and
unsystematic.  The word means different things to different people. 
What may be a dialect to one person might be a full-blown National
Language to another, or just a funny accent to a third.

BCP 47 tags never *need* to use either the region subtag or the script
subtag, unless they are necessary to convey the intended meaning.  A tag
like ja-Jpan-JP is almost never needed, because almost all written
Japanese is using the Japanese writing system ('Jpan') and as used in
Japan ('JP').

I'm not sure what dialect is being posited here that would make the
difference between having to specify a script subtag and not having to.

 Orthographic variants would include also:
 - the various romanization systems (for example Pinyin) and phonetic 
 transcriptions (IPA phonetic, simplified IPA 
 phonology),

'pinyin', 'fonipa'

 - the simplified orthographies (e.g. orthographic reforms in French and 
 German),

'1606nict', '1694acad', '1901', '1996'

 - and some other minor variants (like the vertical presentation for 
 East-Asian scripts, or Boustrophedon 
 presentation for Ancient Greek, if this alters the orientation of characters 
 that had to be encoded differently, and 
 the default mirroring properties are not applicable to the encoded characters 
 in the basic language).
 
 For now these dialectal/orthographic variants of written languages can be 
 registered in the IANA database for BCP 
 47, using codes with at least 5 letters (or with at least 4 letters or digits 
 if there's at least one digit),

A 4-character variant subtag must *begin* with a digit.

 but 
 ideally the dialectal variant should be encoded as a tag BEFORE the 
 orthographic variant.

Why is this important?

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­






Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Karl Pentzlin
Am Mittwoch, 4. August 2010 um 00:31 schrieb Christoph Päper:

CP ... than making sure every instance of a letter is
CP accompanied by the appropriate VS?

My proposal contains the idea of implicit application of variation
sequences by higher-level protocols. I will make this clearer in my
next version.

CP How did you decide what to include in your proposal ...

I will make this clearer also in my next version, which will contain a
paragraph characters vs. variants vs. glyphs.

- Karl Pentzlin







Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

2010-08-04 Thread verdy_p
Asmus Freytag  wrote:
 The Fraktur problem is one where one typestyle requires additional 
 information (e.g. when to select long s) that is not required for 
 rendering the same text in another typestyle. If it is indeed desirable 
 (and possible) to create a correctly encoded string that can be rendered 
 without further change automatically in both typestyles, then adding any 
 necessary variation sequences to ensure that ability might be useful. 
 However, that needs to be addressed in the context of a precise 
 specification of how to encode texts so that they are dual renderable. 
 Only addressing some isolated variation sequences makes no sense.

I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in 
Fraktur style, but even in this 
case, the conversion to long s will be inappropriate. So use the Fraktur 
round s directly.

If a text in Fraktur absolutely requires the long s, it's only when the 
original text was already using this long 
s. In that case, encode the long s: The text will render with a long s in 
both modern Latin font styles like 
Bodoni (with a possible fallback to modern round s if that font does not have 
a long s), an in classic Fraktur 
font styles (with here also a possible fallback to Fraktur round s if the 
Frakut font forgets the long s in its 
repertoire of supported glyphs).

In other words, you don't need any variation sequence: s+VS1 would be 
strictly encoding the same thing as the 
existing encoded long s. Adding this variation selector would just be a 
pollution (an unjustified desunification). 
The two existing characters are already clearly stating their semantic 
differences, so we should continue to use 
them.

This does not mean that fonts should not continue to be enhanced, and that font 
renderers and text-layout engines 
should not be corrected to support more fallbacks (in fact it will be simpler 
to implement these fallbacks within 
text-renderers, instead of requiring a new font version).

You can apply the same policy to the French narrow non-breaking space NNBSP 
(aka fine in French) that fonts do not 
need to map, provided that the font renderers or text layout engines are 
correctly infering its bet fallback as 
THIN SPACE, before retrying with the FIFTH EM SPACE or SIXTH EM SPACE 
characters, then with a standard SPACE 
with a reduced metric...

That's because fonts never care about line-breaking properties, that are 
implemented only in text layout engines. 
The same should apply as well with NBSP, if a font does not map it (the text 
renderer just has to use the fallback 
to SPACE to find the glyph in the selected font), to the NON-BREAKING HYPHEN 
(just infer the fallback to the 
standard HYPHEN, then to MINUS-HYPHEN).

In fact, it would be more elegant if Unicode provided a new property file, 
suggesting the best fallbacks (ordered by 
preference) for each character (these fallbacks possibly having their own 
fallbacks that will be retried if all the 
suggested ordered fallbacks are already failing). In most cases, only one 
fallback will be needed (in very few 
cases, several ordered fallbacks should be listed if the implied sub-fallbacks 
are not in the correct order of 
resolution).

It would avoid selecting glyphs from other fallback fonts with very different 
metrics. Some of these fallbacks are 
already listed in the main UCD file, but they are too generic (because the 
compatibility mappings must resolve ONLY 
to non-compatibility decomposable characters). For example NNBSP has a 
compatibility decomposition as 0020, 
just like many other whitespace characters, so it completely looses the width 
information.

If we had standardized fallback resolution sequences implemented in text 
renderers, we would not need to update 
complex fonts, and the job for font designers would be much simpler, and users 
of existing fonts could continue to 
use them, even if new characters are encoded.

I took the example of NNBSP, because it is one character that has been encoded 
since long now, but vendors are still 
forgetting to provide a glyph mapping for it (for example in core fonts of 
Windows 7 such as the new Segoe UI 
font, even though Microsoft included an explicit mapping for NNBSP in Times New 
Roman). It's one of the frequent 
cases where this can be solved very simply by the text-renderer itself.

The same should be done for providing a correct fallback to round s if ever 
any font does not map the long s.

I also suggest that the lists of standard character fallbacks are scanned 
within the first selected font, without 
trying with other fallback fonts (including multiple font families specified in 
a stylesheet or generic CSS fonts), 
unless the list of fallback characters includes a  specifier in the middle of 
the list that would indicate 
that all the characters (the original or the fallback characters already 
specified before ) should be 
searched (this will be useful mostly for 

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Karl Pentzlin
Am Dienstag, 3. August 2010 um 02:47 schrieb David Starner:

DS ... I don't see why
DS unspecific forms should be encoded; if you want a nonspecific a, 0061
DS is the character.

This is because I take into account the implicit application of a
variation sequence on a base character by a higher-level protocol,
which must be overridable in some way.
In the next version of my proposal, I hope to make this clearer;
propably I also will put another name on the unspecific variants.

- Karl Pentzlin





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Karl Pentzlin
Am Mittwoch, 4. August 2010 um 08:52 schrieb William_J_G Overington:

WO Please know that, whilst I comment on various matters, I am
WO enthusiastic for the general thrust of your suggestion regarding
WO access to alternate glyphs for Latin characters using Variation
WO Selectors. This could produce a renaissance for typography.

Admittedly, I explicitly do not want to introduce glyph encoding into
Unicode through the back door. In the next version of my proposal, you
will find some words about what variation sequences are *not* intended
for.

WO  But while the general mechanisms for doing so are standardized
WO  (i.e. OpenType features), the concrete selection of a specific glyph is 
not.
WO  
WO It is important that the Unicode specification does not regard
WO any particular font technology as being the standard font technology.

This is correct. I mention OpenType only as an example.

WO Why is it not possible specifically to request a one-storey form of 
lowercase letter a?

I did not this, as I do not know a cultural context where the
two-storey form is to be suppressed to prevent an a to be
mistaken for any letter too similar to a two-storey a.

WO What happens in relation to a character such as g circumflex?
WO Would one be able to access a glyph alternate for g circumflex?

The variant selector can be followed by any diacritic which then is
applied to the base character.

WO Could there be variants for lowercase e, ...

I have found none, which of course is no proof of non-existence,

WO for a horizontal line glyph design, and for an angled line,

Not according to the principles outlined in my proposal,

WO  Venetian-style font, glyph design please?

No.

WO Would it be possible to define U+FE15 VARIATION SELECTOR-16 to
WO indicate an end of word alternate glyph for each lowercase Latin
WO character?

No. Even if you find a cultural context where such things are required,
such things are positional variants which are to be handled by the
proven mechanisms developed for scripts like Arabic.

- Karl Pentzlin




Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Karl Pentzlin
Am Dienstag, 3. August 2010 um 19:11 schrieb Janusz S. Bień:

JJSB I see no reason why, if I understand correctly, the long s variant is
JSB to be limited to Fraktur-like styles.

The *variant* is applicable to situations where the character is to be
displayed long when Fraktur-like styles are in effect, while it is to
be displayed round when modern styles are in effect.

The plain *character* long s is intended to be displayed long in all
circumstances.

However, in my next version, I will replace the s variants by long s 
variants:
017F FE00 ...LONG S VARIANT-1 STANDARD FORM
 · will be displayed long in any script variants
017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally)
 · will be displayed long in Fraktur, Gaelic, and similar script variants
 · will usually be displayed round when used with Roman type
This has the advantage, especially when implicit application of variation 
sequences
is possible, it can be applied to existing data without change.

- Karl Pentzlin




re: Dialects and orthographies in BCP 47 (was: Re: Draft Proposal=D=A to add Variation Sequences for Latin and Cyrillic letters)

2010-08-04 Thread verdy_p
Doug Ewell  wrote:
 There is no formal model in the sense of a standard N-letter subtag
 for dialects, because the concept of a dialect is too open-ended and
 unsystematic. The word means different things to different people. 
 What may be a dialect to one person might be a full-blown National
 Language to another, or just a funny accent to a third.

The formal model already exists in ISO 639, that has decided to unify all 
dialectal variants under the same language 
code. Yes the concept is fuzzy, but as long as ISO 639 will not contain a 
formal model for how the various languages 
are grouped in families and subfamilies, it will be impossible to use dialectal 
variant specifiers with accurate 
fallbacks, without using subtags for the language variants.

One know problem is for exampel Norman, which ISO 639 still considers as a 
dialect of French, even though it is just 
ANOTHER Oil language (from which Standard French emerged by merging, modifying 
and extending several dialects).

But Jersiais is now an language with official in Jersey, which is clearly part 
of the Norman family. And that still 
needs to be distinguished from French. Still, there's no ISO 639 code for 
Norman (as a family or as the residual 
language in continentla Normandy in France), and no code for Jersiais as well. 
And French is considered in ISO 639 
as an isolated language, not as as macrolanguage. So it allows no further 
precision.

If something is added, it can only be a variant for the dialectal difference, 
such as fr-norman for the Norman 
family, or fr-jersiais for Jersiais, unless Jersiais gets its own ISO 639-3 
code as an isolated language (leaving 
the continental Norman still as a dialectal variant of French).

The formal definition of languages is the definition of ISO 639-3 isolated 
languages. Everything below is 
dialectal (and ISO 639 has clearly stated that it planned for much later a 
comprehensive encoding of dialectal 
differences, most probably by defining a standard list of variant codes, even 
if these dialects may qualify as 
languages for some users)



It's remarkable that for most linguists, Serbian, Croatian, annd Bosnian are 
only one language, with only dialectal 
differences (in the spoken language and with some grammatical derivations, and 
some minor lexical differences that 
are understood by all Serbo-Croatian speakers), orthographic differences 
(mostly based on their default script, even 
if Serbian still uses the two scripts but it defines a strict transliteration 
system that helps defining a unified 
orthography for both scripts, orthographies that are simplified in Croatian and 
Bosnian).

So yes, the concept of dialects vs. language is fuzzy for linguists and users 
(and nationals that prefer to see 
their dialect named from their country as a full language instead of a 
dialect), but ISO 639 defines a formal model 
by its technical encoding: if there's an authority defending the position of a 
distinct language and defining an 
official lexique and orthography, it becomes a de facto language for ISO 639.

Such split of languages in their dialectal differences promoted to isolated 
languages has occured and was endorsed 
by ISO 639, even if it was probably not in the interest of these countries to 
split their common language and to 
reduce its audience and cultural influence in other parts of the world (and for 
many of their own citizens, they 
won't care a lot about these formal official differences, as long as they 
understand it and can read and write it in 
a script that they can decipher it without difficulties, only because they will 
constantly live near other peoples 
sharing the same language but under a different name).

Serbian is still perceived and encoded as a single language, despite it still 
uses two scripts, depending on the 
region of use (but it is now rapidly converging to the Latin script). May be 
the linguistic and cultural authorities 
of the four concerned countries (or five, now with Kosovo whose independance 
was recently validated by a 
international court?) will decide to reunite their cultural efforts, if they 
finally all use the same Latin script, 
by adopting a new neutral name (Dolmoslavic, Adriatic, Adrislavic ? Or even 
Yugoslavic ?) and increasing their 
mutual cultural exchanges instead of wasting them for old nationalist reasons 
(this will be even more important when 
they will finally ALL join the European Union with increased exchanged between 
them).

Philippe.



Re: Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Asmus Freytag

On 8/4/2010 1:30 PM, verdy_p wrote:

Asmus Freytag  wrote:
  
The Fraktur problem is one where one typestyle requires additional 
information (e.g. when to select long s) that is not required for 
rendering the same text in another typestyle. If it is indeed desirable 
(and possible) to create a correctly encoded string that can be rendered 
without further change automatically in both typestyles, then adding any 
necessary variation sequences to ensure that ability might be useful. 
However, that needs to be addressed in the context of a precise 
specification of how to encode texts so that they are dual renderable. 
Only addressing some isolated variation sequences makes no sense.



I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but 
even in this case, the conversion to long s will be inappropriate. So use the Fraktur 
round s directly.
  
This statement makes clear that you don't understand the rules of 
typesetting text in Fraktur.
If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s. 

This statement is also incorrect.

The rules when to use long s in Fraktur and when to use round s depend 
on the position of the character within the word in complicated ways.


The same word, typeset using Antiqua style will not usually have the long s.

For German, there exist a large number of texts that were typeset in 
both formats, so you can compare for yourself. Even in France, I suspect 
that research libraries would have editions of 19th century German 
classics in both formats.

In that case, encode the long s: The text will render with a long s in both modern Latin font styles like Bodoni 
(with a possible fallback to modern round s if that font does not have a long s), an in classic Fraktur font 
styles (with here also a possible fallback to Fraktur round s if the Frakut font forgets the long s in its repertoire of supported 
glyphs).
  
I'm skipping the rest, of your message because you've started from a 
wrong premise and sorting out which bits still apply even after 
accounting for the wrong premise is not something I have time, energy 
and inclination for. 


Sorry,

A./
  





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread David Starner
On Wed, Aug 4, 2010 at 4:33 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 Am Dienstag, 3. August 2010 um 02:47 schrieb David Starner:

 DS ... I don't see why
 DS unspecific forms should be encoded; if you want a nonspecific a, 0061
 DS is the character.

 This is because I take into account the implicit application of a
 variation sequence on a base character by a higher-level protocol,
 which must be overridable in some way.

I don't see why it must be overridable. By not including a variation
sequence, you've left it up to the system to pick a glyph. Whatever
glyph it picks, you have no right to complain. There is no reason for
the system to do anything with the unspecific form variation sequence.

-- 
Kie ekzistas vivo, ekzistas espero.



Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)

2010-08-04 Thread verdy_p
Asmus Freytag 
  If a text was initially using a round s, nothing prohibits it being 
  rendered in Fraktur style, but even in this 
case, the conversion to long s will be inappropriate. So use the Fraktur 
round s directly.
  
 This statement makes clear that you don't understand the rules of 
 typesetting text in Fraktur.
  If a text in Fraktur absolutely requires the long s, it's only when the 
  original text was already using this 
long s. 
 This statement is also incorrect.
 
 The rules when to use long s in Fraktur and when to use round s depend 
 on the position of the character within the word in complicated ways.
 
 The same word, typeset using Antiqua style will not usually have the long s.

So you juist demonstrate that IF such rule exists and is enforceable, then you 
DON'T need the separate encoding. In 
that case you can safely use a round s everywhere, and let all the appropriate 
round s to be converted automatically 
to long s according to this rule.

Your false assumption is, in my opinion, that such rule exists and is 
enforceable for typesetting into Fraktur. All 
demonstrate that this is NOT the case, just look into actual manuscripts and 
old books, and you'll find very 
frequently that the same book used the rules inconsistently, either because of 
a typo made by the printer (or its 
typists composing the pages), or that the printer wanted to respect the 
original orthography used in the original 
manuscript by the author (the printer decides to NOT decide and maintains that 
orthography, even if it's 
inconsistant).

Now if you're exposed to an original book that was initially typesetted in 
Fraktur, and want to preserve its 
characters, as they are, just use standard round s and standard long s. You 
don't need ANY variation selector. 
You'll only be interested in addding ZWJ for encoding the ligatures that you 
see in the original document. Render it 
to a Fraktur font and you've done the work correctly. nothing is needed.

Now render it with a Bodoni font, and all the long s will be converted to a 
fallback round s, if you use a correct 
typesetting program that will not display squares for missing glyphs. Render it 
on the web in HTML, and the default 
text renderers of browsers will use any font they have (even if you specified 
one, there's no warranty that it will 
be available, or that the user will have not applied a personal stylesheet for 
its own prefered fonts, so fallback 
fonts will still be used), in that case the browers will make all the efforts 
they cant to reproduce the original 
distinctions between long s and round s.

Now if you want to render it to a high-quality Bodoni text, you'll use a font 
or renderer that will either display 
ALL the existing distinctions as they are encoded in the text (ne need of any 
variation selector for that), or NONE 
of them (all long s will be rendered like round s).


 For German, there exist a large number of texts that were typeset in 
 both formats, so you can compare for yourself. Even in France, I suspect 
 that research libraries would have editions of 19th century German 
 classics in both formats.

Yes, but this is not relevant to the issue. You DON'T need any variation 
sequence to encode the differences WHERE 
THEY EXIST. If you want the correct long s in the Fraktur-rendered text, use 
the standard long s where they are and 
nothing else. The same text will still rander with round s in a Bodoni-like 
font, and will display the fraktur 
differences when using a modern font containing mapping the two characters 
into two distint glyphs.

And then only one case remains useful: if you still want that some long s in 
the original Fraktur text must 
convert to long s in a modern style, but others will still convert to round 
s, using the SAME font:

Only for this case, what you'll need is NOT  but REALLY , so that the renderer 
will know 
(with the presence of VS1) that the  is safely convertible to  when using a 
modern font that has 
mappings for both characters. In other words, the modern font will add a 
mapping of  to the same glyph 
as , instead of just to  when ignoring the variation selector. This VS1 will 
encode the long s 
that are not absolutely long when rendering in other styles (such as Antiqua) 
than the original Fraktur. 

For the reversed conversion (from modern texts to Fraktur), that you would use 
for fancy new creations, you won't 
need to encode anything else than  (that will be converted automatically to , 
where 
appropriate and using automatically and consistantly the strict rules), and  
if you still want to force 
them some others (for fancy reasons) into the document rendered in a 
Fraktur-like style (but remember that the 
original was not using , except if they were forced in the original... With 
this scheme you'll still be able 
to preserve the original modern non-Fraktur text.

Philippe.



Re: Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Asmus Freytag

Philipe,

Text typeset in Fraktur contains more information than text typset in 
Antiqua. That means, there are some places where there are some (mild) 
ambiguities in representation in the Antiqua version. Not enough to 
bother a human reader who can use deep context to read the text 
correctly, but enough so that a mere typesetting system cannot correctly 
render such a text in Fraktur.


I'm not currently aware of anything that would prevent an automated 
system from converting a text encoded for Fraktur to one encoded for 
Antiqua, because you are merely throwing away information.


So far we agree.

The question is whether it would be possible to make this process work 
by default in common, unmodified rendering engines, and whether that 
is desirable. (I don't treat either of these question as settled one way 
or the other - so please don't attribute a position to me on that subject).


What I do know is that there are historic documents using Antiqua fonts 
that do use the long s. Therefore, in principle, you don't necessarily 
want to create fonts that map long to round s. And, as an author, you 
can't rely on such a font being present on the reader's end - it might 
equally likely be one that does implement the long s.


So, whatever automatic rendering of Fraktur-ready text with non-Fraktur 
general purpose fonts you have in mind, should not rely on this kind of 
non-standard glyph substitution. That would be a terrible hack, 
imperiling the ability of people to use the long s outside the context 
of the Fraktur tradition.


All I had argued for was that Karl should take out the consideration of 
rendering text encoded for Fraktur from his proposal and make it part of 
a separate document that addresses ALL issues of this type of rendering, 
making it a complete specification - that would be something that allows 
review on its own merits.


A./





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-03 Thread Michael Everson
On 3 Aug 2010, at 01:04, Karl Pentzlin wrote:

 I have compiled a draft proposal:
 Proposal to add Variation Sequences for Latin and Cyrillic letters
 The draft can be downloaded at:
 http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
 The final proposal is intended to be submitted for the next UTC
 starting next Monday (August 9).
 
 Any comments are welcome.

I don't think it is a good idea. In particular the implications for Serbian 
orthography would be most unwelcome.

Michael Everson * http://www.evertype.com/





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-03 Thread Karl Pentzlin
Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson:

ME ... In particular the implications
ME for Serbian orthography would be most unwelcome.

Which kind of implications do you refer to?
The proposed variation sequences simply provide a more general access to
typographic details, which now can be accomplished by more complicated
means like implementing locale-specific glyph selection within a font,
and relying on a higher-level protocol supplying the correct locale
information. (Anyway, such means may stay in effect in parallel to
the use of variation sequences.)

One of the advantages of variation sequences is that the glyph
selection is transparent to the user, instead of to be implemented in
each font in a non-standardized way.

- Karl Pentzlin







Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-03 Thread Christoph Päper
Karl Pentzlin:
 
 The proposed variation sequences simply provide a more general access to 
 typographic details, which now can be accomplished by more complicated means 
 like implementing locale-specific glyph selection within a font, and relying 
 on a higher-level protocol supplying the correct locale information.

How is selecting and setting once a locale (vulgo language) more complicated 
than making sure every instance of a letter is accompanied by the appropriate 
VS? They don’t seem very handy for runs of text, but VS are probably the right 
tool for reference work, e.g. 
http://en.wikipedia.org/wiki/Cyrillic_alphabet#Letterforms_and_typography. So 
it makes sense to specify combinations.

How did you decide what to include in your proposal, though? There are many 
more variants, even when not taking handwritten forms into account, e.g. ‘u’- 
or ‘v’-based ‘y’ and ‘w’ or uppercase letters with diacritics above rendered 
lower so they’re not using more vertical space than the base letters.





Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-02 Thread Karl Pentzlin
I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
 http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted for the next UTC
starting next Monday (August 9).

Any comments are welcome.

- Karl Pentzlin




Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-02 Thread Leo Broukhis
0073 FE00/FE01 - must be LATIN SMALL LETTER S, not LETTER B.

Leo

On Mon, Aug 2, 2010 at 5:04 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 I have compiled a draft proposal:
 Proposal to add Variation Sequences for Latin and Cyrillic letters
 The draft can be downloaded at:
  http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
 The final proposal is intended to be submitted for the next UTC
 starting next Monday (August 9).

 Any comments are welcome.

 - Karl Pentzlin






Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-02 Thread David Starner
On Mon, Aug 2, 2010 at 8:04 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 I have compiled a draft proposal:
 Proposal to add Variation Sequences for Latin and Cyrillic letters
 The draft can be downloaded at:
  http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
 The final proposal is intended to be submitted for the next UTC
 starting next Monday (August 9).

Two things jumped out at me on a quick glance. First, I don't see why
unspecific forms should be encoded; if you want a nonspecific a, 0061
is the character. Secondly, Fraktur and Antiqua are different writing
systems with slightly different orthographies; instead of messing
around with variation sequences, just accept that. If they must be
distinguished, surely the long-s variation sequence could be used in
non-Fraktur fonts, like Blackletter and 18th century-style fonts.

-- 
Kie ekzistas vivo, ekzistas espero.




Romanian and Cyrillic

2004-05-02 Thread D. Starner
I posted this message to the message boards of Distributed Proofreaders-Europe 
dp.rastko.net 
(a joint effort of Project Rastko www.rastko.net and Project Gutenberg 
www.gutenberg.net), 
and got this response from one of the site admins.


 nikola wrote:
 Haha Romanian use Cyrillic up to 19th century, so sooner 
 or later, we WILL have Romanian books in Cyrillic here

Nikola, David refers to Moldavian situation which is little 
bit different compared to situation in modern Romanian state since its formation. 

David, here are some preliminary thoughts:

 Prosfilaes wrote:
 From the Unicode mailing list:
 Quote:
 Since we're talking about Romanian...

 Prior to 1991, the Soviet-controlled administration attempted to create
 a distinct linguistic identity, Moldovian, which as I understand it
 basically amounted to Romanian written in Cyrillic script. (They tried
 to introduce some archaic Romanian forms and Russian loans, but
 apparently none of it stuck.)

I expect gradual influx of Romanian, Moldavian, Tzintzar and Vlach members 
after May 24. I'm in almost daily contact with our friends and colaborators 
from Bucharest and Timisoara these days, relating our Romanian NGO which is 
under the registration at the moment, and they'll also serve as medium of our 
future local Moldavian network.

Before their more detailed opinion, I can offer some analogies which we have 
with similar cases. Bi- or three-alphabet situation is not rare in SE European 
or Eurasian cultures. In previous centuries we find all combinations of paralel 
use of Cyrillic, Glagolithic, Latin, Grek or Arabic scripts among Serbs, Croats, 
Romanians, Albanians etc. Religious or ideological affilitions are to be blame 
for very recent and opressive reducing down to usage of just one major script, 
but even now we have Serbian case with Cyrillic as only standard script, but 
Latin script widely used on daily social level without prejudicies even in the 
core of Serbian culture.

Project Rastko's general policy is more or less to OCR/publish version in original 
script, but also to provide transliterated versions into other common used scripts. 
Although we are proponents of having one official script, we publish Serbian works 
in additional Latin version in order to be easily read also in Muslim or Croat areas 
of former Yugoslavia (which share common language with Serbian culture).

For Romanian and Moldavian books printed in Cyrillic, I suppose only logical solution 
is to apply Rastko's rules: to process it in original script but to parallely publish 
Latin script version which modern Romanian readers could read.

Prosfilaes wrote:
 Quote:
 How relevant is Romanian in Cyrillic script at this point? For instance,
 what's the likelihood that someone might want to put Romanian-Cyrillic
 content on the web? Already being done? A reasonable possibility?
 Extremely unlikely?

It is reasonable possibility. The phenomenon of script is supranational 
and for academic purposes should be also treated as supraconfessional or 
supraideological.

Prosfilaes wrote:
 I know DP-EU plans to do it sometime, but do we have stuff that could be uploaded 
 tomorrow, 
 or is there something in our plans, or is it something that we'll do if and when 
 something 
 clearable comes along (which will be hard, as this is strictly post-1945.)

Tomorrow? Yes, if it is desperately needed, it could be uploaded in less 48 hours by 
Bucharest 
guys. More realistically speaking, the end of the summer or last quarter should be 
more systematic 
phase for Moldavian case.

Copyright clearability does not play an issue, since Rastko's material is mostly of 
modern authors 
who gave non-exclusive rights to publish their works on the Net for free.

David, please let us know anything new you get about this subject, for it could be 
important 
for several publishing projects our network prepares [We have in our computers perhaps 
100 
eBooks processed in 2003 about Romanian culture, waiting to be posted this year]
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




Re: Romanian and Cyrillic

2004-04-30 Thread Radovan Garabik
On Tue, Apr 27, 2004 at 11:29:58PM -0700, Peter Constable wrote:
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 On
 
  Would you need to have the same web-text [in HTML] displayed
  in Romanian as well as in Cyrillic script according to
  the reader's wishes?
 
 It could perhaps be put that way: yes, what I want to know is whether
 there is any potential need to have Romanian-language content such as
 web pages that need to be provided (whether according to a reader's
 wish, or to reflect the form of a historic document) in Cyrillic script
 rather than Latin script.

I did download pages in _Moldavian_ some time before
There is such a singer called Sofia Rotaru, and she was rather popular 
in Soviet Union, and she used to sang in Russian, Ukrainian and
Moldavian (still does - I saw her recently performing on Russian TV,
singing songs in all these three languages, although I do not know how 
is the last language called now).
Anyway, I was looking for lyrics for some songs, and got to a www
page with some texts of her songs. The page was itself in Russian,
but the lyrics were in respective languages, including Moldavian.

The page seemed to be rather recent, with regular updates etc...

-- 
 ---
| Radovan Garabík http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__garabik @ melkor.dnp.fmph.uniba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



Re: Question on Unicode-prevalence (general and for Cyrillic)

2004-03-15 Thread Antoine Leca
Peter Kirk va escriure:
 2. A graduate student mentioned that it was her impression that most
 Cyrillic webpages (at least for Russian--her interest) are still not
 encoded in Unicode. (She is doing some research on the use of
 certain words in Russian and wanted to know how best to do the
 search.)

 Google finds matches not just in Unicode
 encoded pages, but also in ones encoded in other Cyrillic encodings

On the other hand, if the student is willingfull to write some kind of
spider herself, this means it is very likely she shall contemplate all the
encodings, shan't she?


Antoine




Question on Unicode-prevalence (general and for Cyrillic)

2004-03-14 Thread Deborah W. Anderson
Two questions:

1. Is there a way to determine the prevalence of Unicode in electronic file documents 
(vs. documents not in Unicode)? At least for the Web, has anyone done a statistical 
sampling to determine the percentage of Unicode-encoded webpages?

2. A graduate student mentioned that it was her impression that most Cyrillic webpages 
(at least for Russian--her interest) are still not encoded in Unicode. (She is doing 
some research on the use of certain words in Russian and wanted to know how best to do 
the search.) 
Again: Has anyone looked into the situation with Cyrillic in terms of the percentage 
of Web documents in Unicode? 

With thanks,
Debbie Anderson

Deborah Anderson
Researcher, Dept. of Linguistics
UC Berkeley
Email: [EMAIL PROTECTED]
or [EMAIL PROTECTED]
Script Encoding Initiative: www.linguistics.berkeley.edu/~dwanders
 





  1   2   >