subject:"Cyrillic \-"

Am Dienstag, 3. August 2010 um 02:04 schrieb ich:

KP I have compiled a draft proposal:
KP Proposal to add Variation Sequences for Latin and Cyrillic letters

In the meantime, I have submitted a final version to the UTC
(L2/10-280), as the UTC starts upcoming Monday (2010-08-09).
For those who do not have access to L2, it is also found at:
http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic.pdf (4.4 MB).

Thank you to all who participated on the discussions on this list.
According to your hints, I have:
· dropped the proposed variants for Latin small letter s
  (addressing Fraktur/Blackletter), as the special aspects of these
  are to be handled in a separate proposal (if such will be done).
· dropped the unspecific variants for Latin small letter a and g,
· rewritten substantial parts of the introduction, to be more concise
  at the points which had raised questions on this list and elsewhere.

- Karl Pentzlin

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

Am Freitag, 6. August 2010 um 11:08 schrieb Martin J. Dürst:

MJD The Web may finally get to solve this problem, although it may still
MJD take some time to be fully deployed. Please see http://www.w3.org/Fonts/
MJD for more details and pointers.

Variation sequences are a means to support this goal, as they provide
font developers with a standardized and easy understandable means,
which unburdens the font designers as well as the site designers who
decide which font they offer for their intended users of their content.

- Karl Pentzlin

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson:

ME ... In particular the implications
ME for Serbian orthography would be most unwelcome.

As I have outlined in the revised introduction of my proposal,
there are *no* implications for Serbian orthography.
Admittedly, this was a little bit implicit in my first draft.

- Karl Pentzlin

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

Am Donnerstag, 5. August 2010 um 12:31 schrieb William_J_G Overington:

WO Yet what if one wants to use the precomposed g circumflex character?

To search in the text of the Unicode standard for canonical
equivalence is helpful in this case for end users as well as for font
designers and for programmers of rendering systems.

- Karl Pentzlin

Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

Am Mittwoch, 4. August 2010 um 22:44 schrieb ich:

KP However, in my next version, I will replace the s variants by long s 
variants:
KP 017F FE00 ...LONG S VARIANT-1  ... STANDARD FORM
KP  · will be displayed long in any script variants
KP 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally)
KP  · will be displayed long in Fraktur, Gaelic, and similar script 
variants
KP  · will usually be displayed round when used with Roman type
KP This has the advantage, especially when implicit application of variation 
sequences
KP is possible, it can be applied to existing data without change.

In the final version of my proposal, I have completely dropped this,
as this subject obviously needs a separate discussion in a separate proposal.

- Karl Pentzlin

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-06 Thread Michael Everson

Yeah, well, I am not convinced of the merits of your proposal. Sorry.


On 6 Aug 2010, at 22:20, Karl Pentzlin wrote:

 Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson:
 
 ME ... In particular the implications
 ME for Serbian orthography would be most unwelcome.
 
 As I have outlined in the revised introduction of my proposal,
 there are *no* implications for Serbian orthography.
 Admittedly, this was a little bit implicit in my first draft.
 
 - Karl Pentzlin
 
 
 
 
 

Michael Everson * http://www.evertype.com/

Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-05 Thread André Szabolcs Szelp

For the standard form you probably don't need to add a variation selector.
The codepoint for long s itself expresses exactly the semantic to represent
this character as long s in ANY type style.

While I'm not convinced of your variation proposal at all (on the contrary),
if you write it, write it properly. :-)

/Sz

2010/8/4 Karl Pentzlin karl-pentz...@acssoft.de

 Am Dienstag, 3. August 2010 um 19:11 schrieb Janusz S. Bień:

 JJSB I see no reason why, if I understand correctly, the long s variant is
 JSB to be limited to Fraktur-like styles.

 The *variant* is applicable to situations where the character is to be
 displayed long when Fraktur-like styles are in effect, while it is to
 be displayed round when modern styles are in effect.

 The plain *character* long s is intended to be displayed long in all
 circumstances.

 However, in my next version, I will replace the s variants by long s
 variants:
 017F FE00 ...LONG S VARIANT-1 STANDARD FORM
 · will be displayed long in any script variants
 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally)
 · will be displayed long in Fraktur, Gaelic, and similar script
 variants
 · will usually be displayed round when used with Roman type
 This has the advantage, especially when implicit application of variation
 sequences
 is possible, it can be applied to existing data without change.

 - Karl Pentzlin





-- 
Szelp, André Szabolcs

+43 (650) 79 22 400

Re: Dialects and orthographies in BCP 47 (was: Re: Draft ProposalDA to add Variation Sequences for Latin and Cyrillic letters)

2010-08-05 Thread André Szabolcs Szelp

will decide to reunite their cultural efforts [...] and increasing their
mutual cultural exchanges instead of wasting them for old nationalist
reasons

You're either an utmost optimist, or you have really no idea of Eastern
European history, culture and spirit. :-)

I doubt your described scenario will come true in our lifetimes.

/Sz

On Wed, Aug 4, 2010 at 11:10 PM, verdy_p verd...@wanadoo.fr wrote:

 Doug Ewell  wrote:
  There is no formal model in the sense of a standard N-letter subtag
  for dialects, because the concept of a dialect is too open-ended and
  unsystematic. The word means different things to different people.
  What may be a dialect to one person might be a full-blown National
  Language to another, or just a funny accent to a third.

 The formal model already exists in ISO 639, that has decided to unify all
 dialectal variants under the same language
 code. Yes the concept is fuzzy, but as long as ISO 639 will not contain a
 formal model for how the various languages
 are grouped in families and subfamilies, it will be impossible to use
 dialectal variant specifiers with accurate
 fallbacks, without using subtags for the language variants.

 One know problem is for exampel Norman, which ISO 639 still considers as a
 dialect of French, even though it is just
 ANOTHER Oil language (from which Standard French emerged by merging,
 modifying and extending several dialects).

 But Jersiais is now an language with official in Jersey, which is clearly
 part of the Norman family. And that still
 needs to be distinguished from French. Still, there's no ISO 639 code for
 Norman (as a family or as the residual
 language in continentla Normandy in France), and no code for Jersiais as
 well. And French is considered in ISO 639
 as an isolated language, not as as macrolanguage. So it allows no
 further precision.

 If something is added, it can only be a variant for the dialectal
 difference, such as fr-norman for the Norman
 family, or fr-jersiais for Jersiais, unless Jersiais gets its own ISO
 639-3 code as an isolated language (leaving
 the continental Norman still as a dialectal variant of French).

 The formal definition of languages is the definition of ISO 639-3
 isolated languages. Everything below is
 dialectal (and ISO 639 has clearly stated that it planned for much later a
 comprehensive encoding of dialectal
 differences, most probably by defining a standard list of variant codes,
 even if these dialects may qualify as
 languages for some users)

 

 It's remarkable that for most linguists, Serbian, Croatian, annd Bosnian
 are only one language, with only dialectal
 differences (in the spoken language and with some grammatical derivations,
 and some minor lexical differences that
 are understood by all Serbo-Croatian speakers), orthographic differences
 (mostly based on their default script, even
 if Serbian still uses the two scripts but it defines a strict
 transliteration system that helps defining a unified
 orthography for both scripts, orthographies that are simplified in Croatian
 and Bosnian).

 So yes, the concept of dialects vs. language is fuzzy for linguists and
 users (and nationals that prefer to see
 their dialect named from their country as a full language instead of a
 dialect), but ISO 639 defines a formal model
 by its technical encoding: if there's an authority defending the position
 of a distinct language and defining an
 official lexique and orthography, it becomes a de facto language for ISO
 639.

 Such split of languages in their dialectal differences promoted to isolated
 languages has occured and was endorsed
 by ISO 639, even if it was probably not in the interest of these countries
 to split their common language and to
 reduce its audience and cultural influence in other parts of the world (and
 for many of their own citizens, they
 won't care a lot about these formal official differences, as long as they
 understand it and can read and write it in
 a script that they can decipher it without difficulties, only because they
 will constantly live near other peoples
 sharing the same language but under a different name).

 Serbian is still perceived and encoded as a single language, despite it
 still uses two scripts, depending on the
 region of use (but it is now rapidly converging to the Latin script). May
 be the linguistic and cultural authorities
 of the four concerned countries (or five, now with Kosovo whose
 independance was recently validated by a
 international court?) will decide to reunite their cultural efforts, if
 they finally all use the same Latin script,
 by adopting a new neutral name (Dolmoslavic, Adriatic, Adrislavic ? Or even
 Yugoslavic ?) and increasing their
 mutual cultural exchanges instead of wasting them for old nationalist
 reasons (this will be even more important when
 they will finally ALL join the European Union with increased exchanged
 between them).

 Philippe.

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread William_J_G Overington

Thank you for your reply.
 
On Wednesday 4 August 2010, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 
 WO Why is it not possible specifically to request a one-storey form of 
 lowercase letter a?
 
 I did not this, as I do not know a cultural context where the two-storey form 
 is to be suppressed to prevent an a to be mistaken for any letter too 
 similar to a two-storey a.
 
Well, I was intending this as a straightforward way to access glyph alternates.
 
Noticing that you mentioned cultural context, I have now remembered a situation 
that might perhaps be of interest.
 
It was in a thread about fonts for teaching children in the United Kingdom how 
to read and write.
 
http://forum.high-logic.com/viewtopic.php?f=10t=296
 
 WO What happens in relation to a character such as g circumflex? Would one 
 be able to access a glyph alternate for g circumflex?
 
 The variant selector can be followed by any diacritic which then is applied 
 to the base character.
 
Yet what if one wants to use the precomposed g circumflex character?
 
 WO Could there be variants for lowercase e, ...
 
 I have found none, which of course is no proof of
 non-existence,
 
 WO for a horizontal line glyph design, and for an
 angled line,
 
 Not according to the principles outlined in my proposal,
 
 WO  Venetian-style font, glyph design please?
 
 No.
 
I was looking for a way to access a glyph alternate for typography, not for any 
cultural meaning. Maybe one might choose to use an e with an angled line in the 
words Venice and Venetian, for subtle effect in the typography. I find that 
adding alternate glyphs to fonts is a modern trend. There seems no current way 
to access them from plain text.
 
 WO Would it be possible to define U+FE15 VARIATION SELECTOR-16 to indicate 
 an end of word alternate glyph for each lowercase Latin character?
 
 No. Even if you find a cultural context where such things are required, such 
 things are positional variants which are to be handled by the proven 
 mechanisms developed for scripts like Arabic.
 
I am thinking of where a poet might specify an ending version of a glyph at the 
end of the last word on some lines, yet not on others, for poetic effect. I 
think that it would be good if one could specify that in plain text.
  
William Overington
 
5 August 2010

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread William_J_G Overington

On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
 
 However, there's no need to add variation sequences to
 select an *ambiguous* form. Those sequences should be
 removed from the proposal.
 
Are you here talking about such things as alternate glyph styles?
 
It depends what one means by need.
 
Adding alternate glyphs to a font is a trend in modern font design.
 
One approach is to use Private Use Area mappings, which can be used to produce 
stylish hardcopy printouts and stylish graphics for the web, yet there are the 
well-known problems of spell-checking and so on if Private Use Area mappings 
are used for much more than those application areas.
 
The other approach is to use an alternate glyph model, where the underlying 
plain text is conserved. However, this, today, often means using expensive 
software packages with a proprietary file format in order to store the 
information about which glyph to use in each case.
 
I remember those advertisements that CNN used to run promoting the concept of 
advertising. Advertising - your right to choose. One of the advertisements 
distinguished between what people need and what people want.
 
So, maybe people do not need to use alternate glyphs in typography, yet maybe 
they want to do so, maybe they enjoy doing so.
 
I feel that it is entirely reasonable that Unicode and ISO 10646 encode things 
that help people do what they want to do and what they enjoy doing as well as 
what they need to do.
 
William Overington
 
5 August 2010

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread Asmus Freytag


On 8/5/2010 3:47 AM, William_J_G Overington wrote:

On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
 
  

However, there's no need to add variation sequences to
select an *ambiguous* form. Those sequences should be
removed from the proposal.

 
Are you here talking about such things as alternate glyph styles?
  
No, I am referring to the element of the proposal that proposes to have 
a variation sequence that selects the unspecified form for lower case a.
 
It depends what one means by need.
  
I've written a longer answer here: 
http://www.unicode.org/forum/viewtopic.php?f=9t=83start=0


A./

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread Kenneth Whistler


 I am thinking of where a poet might specify an ending version 
 of a glyph at the end of the last word on some lines, yet not 
 on others, for poetic effect. I think that it would be good 
 if one could specify that in plain text.

Why can't a poet find a poetic means of doing that, instead of
depending on a standards organization to provide a standard
means of doing so in plain text? Seems kind of anti-poetic to me. ;-)

--Ken

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread William_J_G Overington

On Tuesday 3 August 2010, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 
 Any comments are welcome.
 
Firstly, thank you for making the document available.
 
I have made a few comments regarding matters that I noticed.
 
Please know that, whilst I comment on various matters, I am enthusiastic for 
the general thrust of your suggestion regarding access to alternate glyphs for 
Latin characters using Variation Selectors. This could produce a renaissance 
for typography.
 
In the document, on page 2, there is the following.
 
quote
 
But while the general mechanisms for doing so are standardized (i.e. OpenType 
features), the concrete selection of a specific glyph is not.
 
end quote
 
It is important that the Unicode specification does not regard any particular 
font technology as being the standard font technology.
 
This issue was discussed in this mailing list some years ago.
 
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0106.html
 
The last two paragraphs of the following post put that post in context.
 
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0095.html
 
Why is it not possible specifically to request a one-storey form of lowercase 
letter a?
 
It seems to me that being able to request either a one-storey form or a 
two-storey form of lowercase letter a would be better.
 
In relation to lowercase g, would it be better to be able to request any one of 
open descender, closed loop descender and unclosed loop descender?
 
For example, the lowercase letters g in the fonts Arial, Times New Roman and 
Trebuchet MS show the three types.
 
What happens in relation to a character such as g circumflex? Would one be able 
to access a glyph alternate for g circumflex?
 
Could there be variants for lowercase e, for a horizontal line glyph design and 
for an angled line, Venetian-style font, glyph design please?
 
Would it be possible to define U+FE15 VARIATION SELECTOR-16 to indicate an end 
of word alternate glyph for each lowercase Latin character? Certainly, some 
usages would be more likely than others, with d, e, h, m, n, t, z being more 
likely to have an end of word alternate glyph than would some other letters, 
yet a general usage for all Latin characters would, in my opinion, be good.
 
William Overington
 
4 August 2010

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread William_J_G Overington

On Tuesday, 3/8/10, Janusz S. Bień jsb...@mimuw.edu.pl wrote:
 
 I see no reason why, if I understand correctly, the long s
 variant is to be limited to Fraktur-like styles.
 
Long s was used with ordinary Roman type in England for English text in at 
least part of the 17th and 18th centuries.
 
How could one express the following please using variation selectors and the 
Zero Width Joiner ZWJ in relation to the two character sequence sh?
 
If you have a long s available, please use it, otherwise please use an ordinary 
s: furthermore, if you have a long s h ligature available please use that 
instead.
 
How could one express the following please using variation selectors and the 
Zero Width Joiner ZWJ in relation to the three character sequence ssi?
 
If you have a long s available, please use it, otherwise please use an ordinary 
s: furthermore, if you have a long s long s i ligature available please use 
that instead.
 
William Overington
 
4 August 2010

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread Andrew West

On 4 August 2010 09:19, William_J_G Overington
wjgo_10...@btinternet.com wrote:

Answering the two questions below on the assumption that s-VS1 0073
FE00 were to be defined as a variation sequence for long s in all
type styles, and without giving any opinion on the merits or otherwise
of Karl's proposal in general, or specifically the merits of
double-encoding long s as a variation sequence.

 How could one express the following please using variation selectors and the 
 Zero Width Joiner ZWJ in relation to the two character sequence sh?

 If you have a long s available, please use it, otherwise please use an 
 ordinary s: furthermore, if you have a long s h ligature available please use 
 that instead.

s-VS1-ZWJ-h

Note that there must be no character between a variation selector and
the base character it applies to, so the ZWJ must go after VS1.

 How could one express the following please using variation selectors and the 
 Zero Width Joiner ZWJ in relation to the three character sequence ssi?

 If you have a long s available, please use it, otherwise please use an 
 ordinary s: furthermore, if you have a long s long s i ligature available 
 please use that instead.

The use of long s versus short s and ligaturing of these letters
varies widely geographically and historically and depending upon
typeface. The following examples would all be valid *if* s-VS1 were to
be defined as a variation sequence for long s (in all type styles):

s-VS1-ZWJ-s-VS1-ZWJ-i -- for a ligatured ſſi as in miſſion (usual in
18th century English typography)
s-VS1-s-i -- for a non-ligatured ſsi as in illuſtriſsimos (usual in
18th century Spanish typography)
s-VS1-ZWJ-s-i -- for a ligatured ſs plus i as in bleſsings (usual
for italics only in 16th and early 17th century English and French
typography)
s-s-VS1-ZWJ-i -- for s plus a ligatured ſi as in utilisſima
(sometimes in 16th century Italian typography)

Andrew

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Andreas Stötzner



Am 03.08.2010 um 02:47 schrieb David Starner:


Fraktur and Antiqua are different writing
systems with slightly different orthographies


No. Fraktur and Antiqua are two (of many) different renderings of the 
Latin writing system.


Regards,
A. Stötzner.

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))

2010-08-04 Thread Leonardo Boiko

On Wed, Aug 4, 2010 at 05:19, William_J_G Overington
 Long s was used with ordinary Roman type in England for English text in at 
 least part of the 17th and 18th centuries.

More on that by babelstone:
http://babelstone.blogspot.com/2006/06/rules-for-long-s.html

(Sorry for the duplicate email William, my mistake.)

-- 
Leonardo Boiko

Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic =9letters (was Re: long s (was: Draft Proposal to add Variation=D=A Sequences for =9Latin and Cyrillic letters))

In my opinion, adding the s+VS1 variation sequence is completely unneeded. If 
you really want a long s, use the code 
assigned to the long s. fonts or renderers should still provide a reasonnable 
fallback to s if the glyph is missing.

This means that all existing ligatures will long s will continue to be encoded 
as well with long s and ZWJ. the 
x+VS1 proposal is an attempt to disunify the long s, when it is NOT needed 
at all.

The only convenient variation sequence would be to add S+VS1 for the capital 
(because long s has no capital) only to 
preserve the long s semantic when converting it to uppercase or titlecase, in 
which case the mapping of S+VS1 to 
lowercase will give again the standard long s.

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread John W Kennedy


On Aug 4, 2010, at 8:20 AM, Andreas Stötzner wrote:

 
 Am 03.08.2010 um 02:47 schrieb David Starner:
 
 Fraktur and Antiqua are different writing
 systems with slightly different orthographies
 
 No. Fraktur and Antiqua are two (of many) different renderings of the Latin 
 writing system.

The two propositions are not mutually exclusive. And it /is/ true that, at 
least at some times, Fraktur and Antiqua have had different orthographies.

-- 
John W Kennedy
There are those who argue that everything breaks even in this old dump of a 
world of ours. I suppose these ginks who argue that way hold that because the 
rich man gets ice in the summer and the poor man gets it in the winter things 
are breaking even for both. Maybe so, but I'll swear I can't see it that way.
  -- The last words of Bat Masterson

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Asmus Freytag


On 8/2/2010 5:04 PM, Karl Pentzlin wrote:

I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
 http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted for the next UTC
starting next Monday (August 9).

Any comments are welcome.

- Karl Pentzlin

  
This is an interesting proposal to deal with the glyph selection problem 
caused by the unification process inherent in character encoding.


When Unicode was first contemplated, the web did not exist and the 
expectation was that it would nearly always be possible to specify the 
font to be used for a given text and that selecting a font would give 
the correct glyph.


As the proposal noted, universal fonts and viewing documents on other 
platforms and systems across the web have made this solution 
unattractive for general texts.


We are left then with these five scenarios

1) Free variation
2) Orthographic variation of isolated characters (by language, e.g. 
different capitals)
3) Orthographic variation of entire texts (e.g. italic Cyrillic forms, 
by language)

4) Orthographic variation by type style (e.g. Fraktur conventions)
5) Notational conventions (e.g. IPA)

For free variation of a glyph, the only possible solutions are either 
font selection or use of a variation sequence. I concur with Karl, that 
in this case, where notable variations have been unified, that adding 
variation selectors is a much more viable means of controlling authorial 
intent than font selection.


If text is language tagged, then Opentype mechanisms exist  in principle 
to handle scenario 2 and 3. For full texts in a certain language, using 
variation selectors throughout is unappealing as a solution.


However, it may be a viable solution for being able to embed correctly 
rendered citations in other text, given that language tagging can be 
separated from the document and that automatic language tagging may 
detect large chunks of text, but not short runs.


The Fraktur problem is one where one typestyle requires additional 
information (e.g. when to select long s) that is not required for 
rendering the same text in another typestyle. If it is indeed desirable 
(and possible) to create a correctly encoded string that can be rendered 
without further change automatically in both typestyles, then adding any 
necessary variation sequences to ensure that ability might be useful. 
However, that needs to be addressed in the context of a precise 
specification of how to encode texts so that they are dual renderable. 
Only addressing some isolated variation sequences makes no sense.


Notational conventions are addressed in Unicode by duplicate encoding 
(IPA) or by variation sequences. The scheme has holes, in that it is not 
possible in a few cases to select one of the variants explicitly, 
instead, the ambiguous form has to be used, in the hope that a font is 
used that will have the proper variant in place for the ambiguous form.


Adding a few variation sequences (like the one to allow the a at 0061 
to be the two story one needed for IPA) would fill the gap for times 
when controlling the precise display font is not available.


However, there's no need to add variation sequences to select an 
*ambiguous* form. Those sequences should be removed from the proposal.


Overall a valuable starting point for a necessary discussion.

A./

Re: Draft Proposal to add Variation Se=D=A quences for Latin and Cyrillic letters

John W Kennedy wrote:
 On Aug 4, 2010, at 8:20 AM, Andreas Stötzner wrote:
  Am 03.08.2010 um 02:47 schrieb David Starner:
  Fraktur and Antiqua are different writing
  systems with slightly different orthographies
  
  No. Fraktur and Antiqua are two (of many) different renderings of the Latin 
  writing system.
 
 The two propositions are not mutually exclusive. And it /is/ true that, at 
 least at some times, Fraktur and 
Antiqua have had different orthographies.

And it is probably the main reason of the inclusion of Latf in ISO 15924, not 
just because it is a script variant, 
but really because it defines a distinct orthography, which should be 
specifiable in BCP 47 language tags.

I think you could apply the same rationale on Hans and Hant as well (not 
really a different script for the UCS, 
but distinct orthographies.) 

Really, Hans, Hant, Latf, Latg could have been avoided in ISO 15924, if 
orthographic variants of the same 
languages had been encoded in the IANA database for BCP 47, independantly of 
the effective font style.

But for now there's still no formal model for encoding language dialects, so 
BCP 47 language tags still need to use 
tags for ISO 3166-1 region codes and for the script variant, when it should 
just qualify the generic script code (or 
it could even drop this ISO 15924 code if there was a formal code for the 
dialect written in a specific orthography: 
we would also deprecate Jpan, Hrkt in ISO 15924).

Orthographic variants would include also:
- the various romanization systems (for example Pinyin) and phonetic 
transcriptions (IPA phonetic, simplified IPA 
phonology),
- the simplified orthographies (e.g. orthographic reforms in French and German),
- and some other minor variants (like the vertical presentation for East-Asian 
scripts, or Boustrophedon 
presentation for Ancient Greek, if this alters the orientation of characters 
that had to be encoded differently, and 
the default mirroring properties are not applicable to the encoded characters 
in the basic language).

For now these dialectal/orthographic variants of written languages can be 
registered in the IANA database for BCP 
47, using codes with at least 5 letters (or with at least 4 letters or digits 
if there's at least one digit), but 
ideally the dialectal variant should be encoded as a tag BEFORE the 
orthographic variant.

The font style prefered for each orthographic variant is still left to the 
rendering system that will apply 
stylesheets according to the language tag. It should not be invalid to use a 
fallback style that will ignore the 
orthographic variants for which there's no font support or no support in the 
font rendering system or page layout 
system.

Philippe.

Dialects and orthographies in BCP 47 (was: Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Doug Ewell

verdy_p verdy underscore p at wanadoo dot fr wrote:

 Really, Hans, Hant, Latf, Latg could have been avoided in ISO 15924, 
 if orthographic variants of the same 
 languages had been encoded in the IANA database for BCP 47, independantly of 
 the effective font style.

Actually it was the opposite; the ability to use standardized ISO 15924
code elements to express concepts like Simplified Han was one of the
driving forces behind RFC 4646 and its shift in focus from whole tags to
subtags.

In any case, the bibliographers and others who use ISO 15924 but not BCP
47 might need to make these distinctions as well.

 But for now there's still no formal model for encoding language dialects, so 
 BCP 47 language tags still need to use 
 tags for ISO 3166-1 region codes and for the script variant, when it should 
 just qualify the generic script code (or 
 it could even drop this ISO 15924 code if there was a formal code for the 
 dialect written in a specific orthography: 
 we would also deprecate Jpan, Hrkt in ISO 15924).

There is no formal model in the sense of a standard N-letter subtag
for dialects, because the concept of a dialect is too open-ended and
unsystematic.  The word means different things to different people. 
What may be a dialect to one person might be a full-blown National
Language to another, or just a funny accent to a third.

BCP 47 tags never *need* to use either the region subtag or the script
subtag, unless they are necessary to convey the intended meaning.  A tag
like ja-Jpan-JP is almost never needed, because almost all written
Japanese is using the Japanese writing system ('Jpan') and as used in
Japan ('JP').

I'm not sure what dialect is being posited here that would make the
difference between having to specify a script subtag and not having to.

 Orthographic variants would include also:
 - the various romanization systems (for example Pinyin) and phonetic 
 transcriptions (IPA phonetic, simplified IPA 
 phonology),

'pinyin', 'fonipa'

 - the simplified orthographies (e.g. orthographic reforms in French and 
 German),

'1606nict', '1694acad', '1901', '1996'

 - and some other minor variants (like the vertical presentation for 
 East-Asian scripts, or Boustrophedon 
 presentation for Ancient Greek, if this alters the orientation of characters 
 that had to be encoded differently, and 
 the default mirroring properties are not applicable to the encoded characters 
 in the basic language).
 
 For now these dialectal/orthographic variants of written languages can be 
 registered in the IANA database for BCP 
 47, using codes with at least 5 letters (or with at least 4 letters or digits 
 if there's at least one digit),

A 4-character variant subtag must *begin* with a digit.

 but 
 ideally the dialectal variant should be encoded as a tag BEFORE the 
 orthographic variant.

Why is this important?

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

Am Mittwoch, 4. August 2010 um 00:31 schrieb Christoph Päper:

CP ... than making sure every instance of a letter is
CP accompanied by the appropriate VS?

My proposal contains the idea of implicit application of variation
sequences by higher-level protocols. I will make this clearer in my
next version.

CP How did you decide what to include in your proposal ...

I will make this clearer also in my next version, which will contain a
paragraph characters vs. variants vs. glyphs.

- Karl Pentzlin

Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

Asmus Freytag  wrote:
 The Fraktur problem is one where one typestyle requires additional 
 information (e.g. when to select long s) that is not required for 
 rendering the same text in another typestyle. If it is indeed desirable 
 (and possible) to create a correctly encoded string that can be rendered 
 without further change automatically in both typestyles, then adding any 
 necessary variation sequences to ensure that ability might be useful. 
 However, that needs to be addressed in the context of a precise 
 specification of how to encode texts so that they are dual renderable. 
 Only addressing some isolated variation sequences makes no sense.

I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in 
Fraktur style, but even in this 
case, the conversion to long s will be inappropriate. So use the Fraktur 
round s directly.

If a text in Fraktur absolutely requires the long s, it's only when the 
original text was already using this long 
s. In that case, encode the long s: The text will render with a long s in 
both modern Latin font styles like 
Bodoni (with a possible fallback to modern round s if that font does not have 
a long s), an in classic Fraktur 
font styles (with here also a possible fallback to Fraktur round s if the 
Frakut font forgets the long s in its 
repertoire of supported glyphs).

In other words, you don't need any variation sequence: s+VS1 would be 
strictly encoding the same thing as the 
existing encoded long s. Adding this variation selector would just be a 
pollution (an unjustified desunification). 
The two existing characters are already clearly stating their semantic 
differences, so we should continue to use 
them.

This does not mean that fonts should not continue to be enhanced, and that font 
renderers and text-layout engines 
should not be corrected to support more fallbacks (in fact it will be simpler 
to implement these fallbacks within 
text-renderers, instead of requiring a new font version).

You can apply the same policy to the French narrow non-breaking space NNBSP 
(aka fine in French) that fonts do not 
need to map, provided that the font renderers or text layout engines are 
correctly infering its bet fallback as 
THIN SPACE, before retrying with the FIFTH EM SPACE or SIXTH EM SPACE 
characters, then with a standard SPACE 
with a reduced metric...

That's because fonts never care about line-breaking properties, that are 
implemented only in text layout engines. 
The same should apply as well with NBSP, if a font does not map it (the text 
renderer just has to use the fallback 
to SPACE to find the glyph in the selected font), to the NON-BREAKING HYPHEN 
(just infer the fallback to the 
standard HYPHEN, then to MINUS-HYPHEN).

In fact, it would be more elegant if Unicode provided a new property file, 
suggesting the best fallbacks (ordered by 
preference) for each character (these fallbacks possibly having their own 
fallbacks that will be retried if all the 
suggested ordered fallbacks are already failing). In most cases, only one 
fallback will be needed (in very few 
cases, several ordered fallbacks should be listed if the implied sub-fallbacks 
are not in the correct order of 
resolution).

It would avoid selecting glyphs from other fallback fonts with very different 
metrics. Some of these fallbacks are 
already listed in the main UCD file, but they are too generic (because the 
compatibility mappings must resolve ONLY 
to non-compatibility decomposable characters). For example NNBSP has a 
compatibility decomposition as 0020, 
just like many other whitespace characters, so it completely looses the width 
information.

If we had standardized fallback resolution sequences implemented in text 
renderers, we would not need to update 
complex fonts, and the job for font designers would be much simpler, and users 
of existing fonts could continue to 
use them, even if new characters are encoded.

I took the example of NNBSP, because it is one character that has been encoded 
since long now, but vendors are still 
forgetting to provide a glyph mapping for it (for example in core fonts of 
Windows 7 such as the new Segoe UI 
font, even though Microsoft included an explicit mapping for NNBSP in Times New 
Roman). It's one of the frequent 
cases where this can be solved very simply by the text-renderer itself.

The same should be done for providing a correct fallback to round s if ever 
any font does not map the long s.

I also suggest that the lists of standard character fallbacks are scanned 
within the first selected font, without 
trying with other fallback fonts (including multiple font families specified in 
a stylesheet or generic CSS fonts), 
unless the list of fallback characters includes a  specifier in the middle of 
the list that would indicate 
that all the characters (the original or the fallback characters already 
specified before ) should be 
searched (this will be useful mostly for

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

Am Dienstag, 3. August 2010 um 02:47 schrieb David Starner:

DS ... I don't see why
DS unspecific forms should be encoded; if you want a nonspecific a, 0061
DS is the character.

This is because I take into account the implicit application of a
variation sequence on a base character by a higher-level protocol,
which must be overridable in some way.
In the next version of my proposal, I hope to make this clearer;
propably I also will put another name on the unspecific variants.

- Karl Pentzlin

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

Am Mittwoch, 4. August 2010 um 08:52 schrieb William_J_G Overington:

WO Please know that, whilst I comment on various matters, I am
WO enthusiastic for the general thrust of your suggestion regarding
WO access to alternate glyphs for Latin characters using Variation
WO Selectors. This could produce a renaissance for typography.

Admittedly, I explicitly do not want to introduce glyph encoding into
Unicode through the back door. In the next version of my proposal, you
will find some words about what variation sequences are *not* intended
for.

WO  But while the general mechanisms for doing so are standardized
WO  (i.e. OpenType features), the concrete selection of a specific glyph is 
not.
WO  
WO It is important that the Unicode specification does not regard
WO any particular font technology as being the standard font technology.

This is correct. I mention OpenType only as an example.

WO Why is it not possible specifically to request a one-storey form of 
lowercase letter a?

I did not this, as I do not know a cultural context where the
two-storey form is to be suppressed to prevent an a to be
mistaken for any letter too similar to a two-storey a.

WO What happens in relation to a character such as g circumflex?
WO Would one be able to access a glyph alternate for g circumflex?

The variant selector can be followed by any diacritic which then is
applied to the base character.

WO Could there be variants for lowercase e, ...

I have found none, which of course is no proof of non-existence,

WO for a horizontal line glyph design, and for an angled line,

Not according to the principles outlined in my proposal,

WO  Venetian-style font, glyph design please?

No.

WO Would it be possible to define U+FE15 VARIATION SELECTOR-16 to
WO indicate an end of word alternate glyph for each lowercase Latin
WO character?

No. Even if you find a cultural context where such things are required,
such things are positional variants which are to be handled by the
proven mechanisms developed for scripts like Arabic.

- Karl Pentzlin

Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

Am Dienstag, 3. August 2010 um 19:11 schrieb Janusz S. Bień:

JJSB I see no reason why, if I understand correctly, the long s variant is
JSB to be limited to Fraktur-like styles.

The *variant* is applicable to situations where the character is to be
displayed long when Fraktur-like styles are in effect, while it is to
be displayed round when modern styles are in effect.

The plain *character* long s is intended to be displayed long in all
circumstances.

However, in my next version, I will replace the s variants by long s 
variants:
017F FE00 ...LONG S VARIANT-1 STANDARD FORM
 · will be displayed long in any script variants
017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally)
 · will be displayed long in Fraktur, Gaelic, and similar script variants
 · will usually be displayed round when used with Roman type
This has the advantage, especially when implicit application of variation 
sequences
is possible, it can be applied to existing data without change.

- Karl Pentzlin

re: Dialects and orthographies in BCP 47 (was: Re: Draft Proposal=D=A to add Variation Sequences for Latin and Cyrillic letters)

Doug Ewell  wrote:
 There is no formal model in the sense of a standard N-letter subtag
 for dialects, because the concept of a dialect is too open-ended and
 unsystematic. The word means different things to different people. 
 What may be a dialect to one person might be a full-blown National
 Language to another, or just a funny accent to a third.

The formal model already exists in ISO 639, that has decided to unify all 
dialectal variants under the same language 
code. Yes the concept is fuzzy, but as long as ISO 639 will not contain a 
formal model for how the various languages 
are grouped in families and subfamilies, it will be impossible to use dialectal 
variant specifiers with accurate 
fallbacks, without using subtags for the language variants.

One know problem is for exampel Norman, which ISO 639 still considers as a 
dialect of French, even though it is just 
ANOTHER Oil language (from which Standard French emerged by merging, modifying 
and extending several dialects).

But Jersiais is now an language with official in Jersey, which is clearly part 
of the Norman family. And that still 
needs to be distinguished from French. Still, there's no ISO 639 code for 
Norman (as a family or as the residual 
language in continentla Normandy in France), and no code for Jersiais as well. 
And French is considered in ISO 639 
as an isolated language, not as as macrolanguage. So it allows no further 
precision.

If something is added, it can only be a variant for the dialectal difference, 
such as fr-norman for the Norman 
family, or fr-jersiais for Jersiais, unless Jersiais gets its own ISO 639-3 
code as an isolated language (leaving 
the continental Norman still as a dialectal variant of French).

The formal definition of languages is the definition of ISO 639-3 isolated 
languages. Everything below is 
dialectal (and ISO 639 has clearly stated that it planned for much later a 
comprehensive encoding of dialectal 
differences, most probably by defining a standard list of variant codes, even 
if these dialects may qualify as 
languages for some users)



It's remarkable that for most linguists, Serbian, Croatian, annd Bosnian are 
only one language, with only dialectal 
differences (in the spoken language and with some grammatical derivations, and 
some minor lexical differences that 
are understood by all Serbo-Croatian speakers), orthographic differences 
(mostly based on their default script, even 
if Serbian still uses the two scripts but it defines a strict transliteration 
system that helps defining a unified 
orthography for both scripts, orthographies that are simplified in Croatian and 
Bosnian).

So yes, the concept of dialects vs. language is fuzzy for linguists and users 
(and nationals that prefer to see 
their dialect named from their country as a full language instead of a 
dialect), but ISO 639 defines a formal model 
by its technical encoding: if there's an authority defending the position of a 
distinct language and defining an 
official lexique and orthography, it becomes a de facto language for ISO 639.

Such split of languages in their dialectal differences promoted to isolated 
languages has occured and was endorsed 
by ISO 639, even if it was probably not in the interest of these countries to 
split their common language and to 
reduce its audience and cultural influence in other parts of the world (and for 
many of their own citizens, they 
won't care a lot about these formal official differences, as long as they 
understand it and can read and write it in 
a script that they can decipher it without difficulties, only because they will 
constantly live near other peoples 
sharing the same language but under a different name).

Serbian is still perceived and encoded as a single language, despite it still 
uses two scripts, depending on the 
region of use (but it is now rapidly converging to the Latin script). May be 
the linguistic and cultural authorities 
of the four concerned countries (or five, now with Kosovo whose independance 
was recently validated by a 
international court?) will decide to reunite their cultural efforts, if they 
finally all use the same Latin script, 
by adopting a new neutral name (Dolmoslavic, Adriatic, Adrislavic ? Or even 
Yugoslavic ?) and increasing their 
mutual cultural exchanges instead of wasting them for old nationalist reasons 
(this will be even more important when 
they will finally ALL join the European Union with increased exchanged between 
them).

Philippe.

Re: Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Asmus Freytag

On 8/4/2010 1:30 PM, verdy_p wrote:

Asmus Freytag wrote:

The Fraktur problem is one where one typestyle requires additional
information (e.g. when to select long s) that is not required for
rendering the same text in another typestyle. If it is indeed desirable
(and possible) to create a correctly encoded string that can be rendered
without further change automatically in both typestyles, then adding any
necessary variation sequences to ensure that ability might be useful.
However, that needs to be addressed in the context of a precise
specification of how to encode texts so that they are dual renderable.
Only addressing some isolated variation sequences makes no sense.

I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but
even in this case, the conversion to long s will be inappropriate. So use the Fraktur
round s directly.

This statement makes clear that you don't understand the rules of
typesetting text in Fraktur.
If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s.

This statement is also incorrect.

The rules when to use long s in Fraktur and when to use round s depend
on the position of the character within the word in complicated ways.

The same word, typeset using Antiqua style will not usually have the long s.

For German, there exist a large number of texts that were typeset in
both formats, so you can compare for yourself. Even in France, I suspect
that research libraries would have editions of 19th century German
classics in both formats.

In that case, encode the long s: The text will render with a long s in both modern Latin font styles like Bodoni
(with a possible fallback to modern round s if that font does not have a long s), an in classic Fraktur font
styles (with here also a possible fallback to Fraktur round s if the Frakut font forgets the long s in its repertoire of supported
glyphs).

I'm skipping the rest, of your message because you've started from a
wrong premise and sorting out which bits still apply even after
accounting for the wrong premise is not something I have time, energy
and inclination for.

Sorry,

A./

Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread David Starner

On Wed, Aug 4, 2010 at 4:33 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 Am Dienstag, 3. August 2010 um 02:47 schrieb David Starner:

 DS ... I don't see why
 DS unspecific forms should be encoded; if you want a nonspecific a, 0061
 DS is the character.

 This is because I take into account the implicit application of a
 variation sequence on a base character by a higher-level protocol,
 which must be overridable in some way.

I don't see why it must be overridable. By not including a variation
sequence, you've left it up to the system to pick a glyph. Whatever
glyph it picks, you have no right to complain. There is no reason for
the system to do anything with the unspecific form variation sequence.

-- 
Kie ekzistas vivo, ekzistas espero.

Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)