Re: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re: Proposing Fraktur)

2002-01-30 Thread Karl Pentzlin

Am Mittwoch, 30. Januar 2002 um 00:39 schrieb Philipp Reichmuth:

PR> ... for example, in German hyphenation the consonant
PR> cluster "ck" gets hyphenated as "k-k" under some circumstances. This
PR> is a rule as well, but still it is a clear case where putting it into
PR> the encoding by means of a hypothetical "UNUSUAL HYPHENATION SELECTOR"
PR> would be a bit inappropriate.

This is a complete algorithmic decision. "Some circumstances" is
practically identical to "using old (i.e. pre-1998) ortography" (at least
I don't know a German compound word which first part ends in -c and which
second part starts with k-). The new orthography hyphenates before the
-ck. (Thus, the decision how to hyphenate "ck" is for the whole text,
not for the individual position, and does not need to be marked there.)

PR>  I think most of these cases, including
PR> the Fraktur problem, deal with _typesetting_ rules and should thus be
PR> left to _typesetting_ software, i.e. the now-famous "higher level
PR> protocol".

The question is, are typesetting rules "part of the script"?

(I mean rules in the sense of obligatory regulations, not guidelines).
If yes, (in my opinion) the plain text must carry the information that is
needed to follow them. If no, their execution can be left to higher level
protocols (which then have to decide whether a word is a foreign word
[to be set in Roman letters] or a name [to be set in Fraktur letters],
such at least according to German typesetting rules).

PR> Would this mean much of an advantage over selecting a different font
PR> for the respective character by means of markup?

The advantage is that you can encode text to be displayed correctly
(i.e. according to the obligatory typesetting rules) in Fraktur as
plain text. You even can display this text correctly in Fraktur or
Roman without change (as you can encode a Serbocroatian plain text to
be displayed in Latin or Cyrillic correctly without change).

Fraktur and Roman are "script variants", not "font variants". Both
"script variants" have a lot of fonts, but they are not fonts themselves.

If you regard the typesetting rules as "part of the script", you can
look at Fraktur as a script variant which has four cases:
"upper/lower for foreign words" and "upper/lower for the rest".
The former accidentily happen to look like the two cases of the Roman
script variant; thus you can use a Roman font for these two cases and
another "real Fraktur letter" font for the other two.
Cases could be left to higher level protocols, but for good reasons
they are not.

--
Karl Pentzlin
AC&S Analysis Consulting & Software GmbH
München, Germany





Re: Unicode Search Engines

2002-01-30 Thread Stefan Probst

Hello Doug,

concluding from how well you understood the issue (including your case 5), 
one could think, you were Vietnamese ;)

It is exactly the "dot below" which makes the most problems, since its 
combining class (220) is lower than some of the modifiers (230).
And unfortunately other tonal marks have the same combining class like 
modifiers (230), and therefore the sorting seems to be not even specified!

To have the information together:
The modifiers, which change the base character to form a new character:
breve   U+0306  combining class: 230
circumflex  U+0302  combining class: 230
hornU+031B  combining class: 216
The tonal marks, which have only a very loose connection with the character 
(i.e. in handwriting they are often even placed above two adjacent vowels):
grave   U+0300  combining class: 230
hook above  U+0309  combining class: 230
tilde   U+0303  combining class: 230
acute   U+0301  combining class: 230
dot below   U+0323  combining class: 220

I made already test pages, e.g. the one at
http://www.isoc-vn.org/www/standard/normalizationtest13.html

The issue runs even a bit further:

(1) Sorting
It is said, that in sorting, all combining marks should be disregarded.
While in Vietnamese this is OK for the (combining) tone marks, it is 
absolutely not OK for the (combining) modifiers. In Vietnamese, e.g. an "a" 
with "circumflex" is a completely different character than an "a" alone.
This is, why some circles in Vietnam prefer what I call "VN-combined": base 
character and modifier pre-composed, tone mark combining.
(2) Converting
Inside of Vietnam, in the past, there were mainly two different encodings used:
- "TCVN-ABC": Fully pre-composed, but a separate font for some upper case 
characters
- "VNI": Mainly using combining characters
When converting old documents (office and web) to Unicode, the question 
will be, whether the tools will do any normalization (especially in case of 
VNI), or just only re-map [combining] character by [combining] character.

And to make things worse, it seems, that MS prefers the combining way, 
saying that their sorting, spell check, word wrap etc. works that way

Vietnam plans to make Unicode compulsory for state offices by middle of 2002.
I have been asked to advise, and volunteered to take mainly care about 
Internet issues.

Right now, in Vietnam they are still discussing, whether they should 
require a specific normalization, and if so, which one of the four possible 
candidates.

According to W3C's draft at http://www.w3.org/TR/charmod/#sec-Normalization 
it seems, that all Web Applications (and that might include search 
engines?) should reject (to be precise: MUST NOT handle) everything which 
is not NFC. This could mean, that search engines MUST NOT index pages in 
"not NFC" and reject queries in "not NFC". If they do: fine. If not: then 
we have probably quite some problems...


And since we are already in Vietnamese (to round the things up):
I am not sure, how e.g. in the introduction to dictionaries or Vietnamese 
language books, the tonal mark can be printed "alone". One solution might 
be to combine them with a "space", but at present, this does not work always.
And only some of the tonal marks seem to have a "stand-alone version", e.g. 
U+02CB for the "grave".

Best Regards,
Stefan


At 01:29 30.01.2002 -0500, [EMAIL PROTECTED] wrote:
-
>In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
>[EMAIL PROTECTED] writes:
>
> > I would like to add:
> > How do they handle normalization?
> > In Vietnam, many characters can be represented in several different ways:
> > (1) fully precomposed (NFC)
> > (2) base character and modifier precomposed, tonal mark combining
> > (3) base character, then modifier, then tonal mark
> > (4) like (3), but modifier and tonal mark sorted (NFD)
> > Do the search engines do any normalization, before indexing a page?
> > Are queries normalized before running the search?
>
>I'm not sure what sort of normalization might be performed by search engines,
>but I want to examine the Vietnamese decomposition aspect for a moment.
>
>If you have a Vietnamese vowel with both modifier and tone mark, say LATIN
>CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in
>Unicode in at least three ways:
>
>(1) fully precomposed (NFC) -- that is, U+1EA4
>(2) base character and modifier precomposed, tonal mark combining -- that is,
>U+00C2 U+0301
>(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302
>U+0301
>
>So far, so good.  But then we have:
>
> > (4) like (3), but modifier and tonal mark sorted (NFD)
>
>If "sorting" the diacritical marks in NFD results in rearranging the two
>diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of
>Vietnamese orthography, the NFD form may not really be a legitimate way of
>representing the Vietnamese letter.
>
>For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT

RE: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re: Proposing Fraktur)

2002-01-30 Thread Marco Cimarosti

Karl Pentzlin wrote:
> [...] (as you can encode a Serbocroatian plain text to
> be displayed in Latin or Cyrillic correctly without change).

I guess you are talking about old Yugoslav character sets, as this would not
be possible in Unicode.

Another case of a single encoding which overlaps more than one script is
ISCII, the Indian standard encoding.

> Fraktur and Roman are "script variants", not "font variants". Both
> "script variants" have a lot of fonts, but they are not fonts 
> themselves.

In rich text, you don't necessarily have to set a different font for roman
words in Fraktur text: the higher level protocol could be designed to have a
"roman" or "loanword" tag which is independent of font choice.

In plain text, I think that plane 14 language tags could be used: imagine
defining a language "old Swedish" and a sub language "old Swedish/LOANWORD".
But I know that these language tags are not very popular, and perhaps I am
stretching their usage scope too much...

_ Marco




Re: Proposing Fraktur

2002-01-30 Thread Michael Bauer


> origin, while katakana and hiragana letters are very different and
generally
> derive from completely different ideographs.
> Mark

Actually no. Of the 46 syllables, 31 have a shared root, only the derivation
is different (block writing for katakans and fast handwriting for hiragana)
... not quite what I'd call "generally" ; )

Mìcheal





RE: Proposing Fraktur

2002-01-30 Thread Marco Cimarosti

Michael Bauer wrote:
> > origin, while katakana and hiragana letters are very different and
> generally
> > derive from completely different ideographs.
> > Mark

Mark or Marco? Well, anyway, the root is shared. :-)

> Actually no. Of the 46 syllables, 31 have a shared root, only 
> the derivation is different (block writing for katakans and
> fast handwriting for hiragana)
> ... not quite what I'd call "generally" ; )

Oh, right! Although I count 48 syllables and 30 shared roots, that doesn't
change the basic the fact that my "generally" is to be corrected as
"sometimes" or "often" at best...

> Mìcheal

Mìcheal or Michael? Well, anyway, the root is shared. :-)

_ Marco




fraktur numerals, etc.

2002-01-30 Thread $B$m!;!;!;!;(B $B$m!;!;!;(B

I do not think that there are real Fraktur numerals. I wonder why?

I wrote to a Japanese guy who had a hard time trying to make a fraktur 
hiragana font. Some kana, like "ya" and "yu", adapt beautifully to fraktur. 
A few of them, like "me", are very difficult, but it is probably possible. 
Look at how a fraktur "k" looks almost nothing like a Roman "k"!
Rule 1: Do not use fraktur kana to tattoo your Japanese girlfriend's name 
on you!! Don't get any tattoo of any girlfriend's name; it is bad luck.
Rule 2: I can think of some Japanese businesswomen who would love fraktur 
kana to advertise their business. ^_^

Back to the numerals.
I think I've maybe seen some slightly fraktur-ish numerals in a font called 
Mariage, but I am not sure. I think Mariage is the most visually attractive 
Latin (alphabet) font of all time.

$B"*!!$8$e$&$$$C$A$c$s!!"+(B
$B!!$@$s$;$$$i$7$5$`$h$&(B


_
$BBg?M5$$N2qOC%D!<%k(B MSN $B%a%C%;%s%8%c!<$N%@%&%s%m!<%I$O$3$A$i(B 
http://messenger.msn.co.jp/


Re: Unicode Search Engines

2002-01-30 Thread Mark Davis

It is not a 'fatal flaw'. NFD makes to pretensions to represent the
most 'natural' ordering for any given language. Out of all the
possible canonically equivalent sequences, it is simply a specific,
well-defined, unique representation that is fully decomposed.

The issue of canonical equivalence itself is that that the circumflex
and dot-below can come in any order and have precisely the same
appearance, *and* that we could not predict the 'natural' order for
any given language.

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

- Original Message -
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, January 29, 2002 22:51
Subject: Re: Unicode Search Engines


> In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> [EMAIL PROTECTED] writes:
>
> > I would like to add:
> > How do they handle normalization?
> > In Vietnam, many characters can be represented in several
different ways:
> > (1) fully precomposed (NFC)
> > (2) base character and modifier precomposed, tonal mark combining
> > (3) base character, then modifier, then tonal mark
> > (4) like (3), but modifier and tonal mark sorted (NFD)
> > Do the search engines do any normalization, before indexing a
page?
> > Are queries normalized before running the search?
>
> I'm not sure what sort of normalization might be performed by search
engines,
> but I want to examine the Vietnamese decomposition aspect for a
moment.
>
> If you have a Vietnamese vowel with both modifier and tone mark, say
LATIN
> CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent
this in
> Unicode in at least three ways:
>
> (1) fully precomposed (NFC) -- that is, U+1EA4
> (2) base character and modifier precomposed, tonal mark combining --
that is,
> U+00C2 U+0301
> (3) base character, then modifier, then tonal mark -- that is,
U+0041 U+0302
> U+0301
>
> So far, so good.  But then we have:
>
> > (4) like (3), but modifier and tonal mark sorted (NFD)
>
> If "sorting" the diacritical marks in NFD results in rearranging the
two
> diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in
terms of
> Vietnamese orthography, the NFD form may not really be a legitimate
way of
> representing the Vietnamese letter.
>
> For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
BELOW is,
> in Vietnamese, a circumflexed A to which a tone mark (dot below) has
been
> added.  It is not a dotted-below A to which a circumflex has been
added.  Yet
> because of the canonical combining classes of the two diacriticals
(230 for
> COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
latter is how
> the character will be decomposed.
>
> In theory, there is actually a case 5: base character and tonal mark
> precomposed, modifier combining.  In terms of Vietnamese
orthography, this is
> just as illegitimate as case 4 (NFD), but most software that
processes
> Vietnamese text will probably never encounter it.  But it will have
to handle
> the NFD case.
>
> If I were on some other mailing lists I could think of, I would
claim that
> this is a fatal flaw in the design of Unicode Normalization Form D.
It's
> not, but it is a sticky problem that needs to be dealt with when
dealing with
> Vietnamese text.
>
> -Doug Ewell
>  Fullerton, California
>
>





Re: Unicode Search Engines

2002-01-30 Thread Misha . Wolf


On 30/01/2002 15:30:06 Mark Davis wrote:
> It is not a 'fatal flaw'. NFD makes to pretensions to represent the

I imagine that "to" -> "no".

Misha

> most 'natural' ordering for any given language. Out of all the
> possible canonically equivalent sequences, it is simply a specific,
> well-defined, unique representation that is fully decomposed.
>
> The issue of canonical equivalence itself is that that the circumflex
> and dot-below can come in any order and have precisely the same
> appearance, *and* that we could not predict the 'natural' order for
> any given language.
>
> Mark
> —
>
> Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
>πάντα — Ὁμήρου Μαργίτῃ
> [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
> http://www.macchiato.com
>
> - Original Message -
> From: <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Tuesday, January 29, 2002 22:51
> Subject: Re: Unicode Search Engines
>
>
> > In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> > [EMAIL PROTECTED] writes:
> >
> > > I would like to add:
> > > How do they handle normalization?
> > > In Vietnam, many characters can be represented in several
> different ways:
> > > (1) fully precomposed (NFC)
> > > (2) base character and modifier precomposed, tonal mark combining
> > > (3) base character, then modifier, then tonal mark
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > > Do the search engines do any normalization, before indexing a
> page?
> > > Are queries normalized before running the search?
> >
> > I'm not sure what sort of normalization might be performed by search
> engines,
> > but I want to examine the Vietnamese decomposition aspect for a
> moment.
> >
> > If you have a Vietnamese vowel with both modifier and tone mark, say
> LATIN
> > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent
> this in
> > Unicode in at least three ways:
> >
> > (1) fully precomposed (NFC) -- that is, U+1EA4
> > (2) base character and modifier precomposed, tonal mark combining --
> that is,
> > U+00C2 U+0301
> > (3) base character, then modifier, then tonal mark -- that is,
> U+0041 U+0302
> > U+0301
> >
> > So far, so good.  But then we have:
> >
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> >
> > If "sorting" the diacritical marks in NFD results in rearranging the
> two
> > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in
> terms of
> > Vietnamese orthography, the NFD form may not really be a legitimate
> way of
> > representing the Vietnamese letter.
> >
> > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
> BELOW is,
> > in Vietnamese, a circumflexed A to which a tone mark (dot below) has
> been
> > added.  It is not a dotted-below A to which a circumflex has been
> added.  Yet
> > because of the canonical combining classes of the two diacriticals
> (230 for
> > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
> latter is how
> > the character will be decomposed.
> >
> > In theory, there is actually a case 5: base character and tonal mark
> > precomposed, modifier combining.  In terms of Vietnamese
> orthography, this is
> > just as illegitimate as case 4 (NFD), but most software that
> processes
> > Vietnamese text will probably never encounter it.  But it will have
> to handle
> > the NFD case.
> >
> > If I were on some other mailing lists I could think of, I would
> claim that
> > this is a fatal flaw in the design of Unicode Normalization Form D.
> It's
> > not, but it is a sticky problem that needs to be dealt with when
> dealing with
> > Vietnamese text.
> >
> > -Doug Ewell
> >  Fullerton, California
> >
> >
>
>

-- --
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.




Re: Unicode Search Engines

2002-01-30 Thread Mark Davis

yes, thanks.

marq
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

- Original Message -
From: <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, January 30, 2002 07:48
Subject: Re: Unicode Search Engines


>
> On 30/01/2002 15:30:06 Mark Davis wrote:
> > It is not a 'fatal flaw'. NFD makes to pretensions to represent
the
>
> I imagine that "to" -> "no".
>
> Misha
>
> > most 'natural' ordering for any given language. Out of all the
> > possible canonically equivalent sequences, it is simply a
specific,
> > well-defined, unique representation that is fully decomposed.
> >
> > The issue of canonical equivalence itself is that that the
circumflex
> > and dot-below can come in any order and have precisely the same
> > appearance, *and* that we could not predict the 'natural' order
for
> > any given language.
> >
> > Mark
> > —
> >
> > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
>πάντα — Ὁμήρου Μαργίτῃ
> > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> > http://www.macchiato.com
> >
> > - Original Message -
> > From: <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Cc: <[EMAIL PROTECTED]>
> > Sent: Tuesday, January 29, 2002 22:51
> > Subject: Re: Unicode Search Engines
> >
> >
> > > In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> > > [EMAIL PROTECTED] writes:
> > >
> > > > I would like to add:
> > > > How do they handle normalization?
> > > > In Vietnam, many characters can be represented in several
> > different ways:
> > > > (1) fully precomposed (NFC)
> > > > (2) base character and modifier precomposed, tonal mark
combining
> > > > (3) base character, then modifier, then tonal mark
> > > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > > > Do the search engines do any normalization, before indexing a
> > page?
> > > > Are queries normalized before running the search?
> > >
> > > I'm not sure what sort of normalization might be performed by
search
> > engines,
> > > but I want to examine the Vietnamese decomposition aspect for a
> > moment.
> > >
> > > If you have a Vietnamese vowel with both modifier and tone mark,
say
> > LATIN
> > > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can
represent
> > this in
> > > Unicode in at least three ways:
> > >
> > > (1) fully precomposed (NFC) -- that is, U+1EA4
> > > (2) base character and modifier precomposed, tonal mark
combining --
> > that is,
> > > U+00C2 U+0301
> > > (3) base character, then modifier, then tonal mark -- that is,
> > U+0041 U+0302
> > > U+0301
> > >
> > > So far, so good.  But then we have:
> > >
> > > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > >
> > > If "sorting" the diacritical marks in NFD results in rearranging
the
> > two
> > > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then
in
> > terms of
> > > Vietnamese orthography, the NFD form may not really be a
legitimate
> > way of
> > > representing the Vietnamese letter.
> > >
> > > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND
DOT
> > BELOW is,
> > > in Vietnamese, a circumflexed A to which a tone mark (dot below)
has
> > been
> > > added.  It is not a dotted-below A to which a circumflex has
been
> > added.  Yet
> > > because of the canonical combining classes of the two
diacriticals
> > (230 for
> > > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
> > latter is how
> > > the character will be decomposed.
> > >
> > > In theory, there is actually a case 5: base character and tonal
mark
> > > precomposed, modifier combining.  In terms of Vietnamese
> > orthography, this is
> > > just as illegitimate as case 4 (NFD), but most software that
> > processes
> > > Vietnamese text will probably never encounter it.  But it will
have
> > to handle
> > > the NFD case.
> > >
> > > If I were on some other mailing lists I could think of, I would
> > claim that
> > > this is a fatal flaw in the design of Unicode Normalization Form
D.
> > It's
> > > not, but it is a sticky problem that needs to be dealt with when
> > dealing with
> > > Vietnamese text.
> > >
> > > -Doug Ewell
> > >  Fullerton, California
> > >
> > >
> >
> >
>
> -- --
> Visit our Internet site at http://www.reuters.com
>
> Any views expressed in this message are those of  the  individual
> sender,  except  where  the sender specifically states them to be
> the views of Reuters Ltd.
>





Re: Unicode Search Engines

2002-01-30 Thread Mark Davis

> (1) Sorting
> It is said, that in sorting, all combining marks should be
disregarded.
> While in Vietnamese this is OK for the (combining) tone marks, it is
> absolutely not OK for the (combining) modifiers. In Vietnamese, e.g.
an "a"

That is not the position taken in Unicode. Combining marks should be
taken into account in sorting in a tailoring that is based upon how
they are handled in the language in question. For example, particular
ones may be treated as tones and sorted on a third level, while others
may be treated as letter modifiers and sorted on the first level.
Different combinations can also be sorted differently, according to
the requirements of the language. For more information, see the UCA:
http://www.unicode.org/reports/tr10/).

Also, the UCA specifically requires that canonical equivalence be
maintained (unless the source domain is limited to strings that do not
contain alternates), so conformant application of the UCA will sort
all of the following the same:

> >(1) fully precomposed (NFC) -- that is, U+1EA4
> >(2) base character and modifier precomposed, tonal mark
combining -- that is,
> >U+00C2 U+0301
> >(3) base character, then modifier, then tonal mark -- that is,
U+0041 U+0302
> >U+0301
> > (4) like (3), but modifier and tonal mark sorted (NFD)
also in some cases
(5) base character and tonal mark composed, modifier combining

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

- Original Message -
From: "Stefan Probst" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Cc: "Martin Duerst" <[EMAIL PROTECTED]>
Sent: Wednesday, January 30, 2002 01:31
Subject: Re: Unicode Search Engines


> Hello Doug,
>
> concluding from how well you understood the issue (including your
case 5),
> one could think, you were Vietnamese ;)
>
> It is exactly the "dot below" which makes the most problems, since
its
> combining class (220) is lower than some of the modifiers (230).
> And unfortunately other tonal marks have the same combining class
like
> modifiers (230), and therefore the sorting seems to be not even
specified!
>
> To have the information together:
> The modifiers, which change the base character to form a new
character:
> breve   U+0306  combining class: 230
> circumflex  U+0302  combining class: 230
> hornU+031B  combining class: 216
> The tonal marks, which have only a very loose connection with the
character
> (i.e. in handwriting they are often even placed above two adjacent
vowels):
> grave   U+0300  combining class: 230
> hook above  U+0309  combining class: 230
> tilde   U+0303  combining class: 230
> acute   U+0301  combining class: 230
> dot below   U+0323  combining class: 220
>
> I made already test pages, e.g. the one at
> http://www.isoc-vn.org/www/standard/normalizationtest13.html
>
> The issue runs even a bit further:
>
> (1) Sorting
> It is said, that in sorting, all combining marks should be
disregarded.
> While in Vietnamese this is OK for the (combining) tone marks, it is
> absolutely not OK for the (combining) modifiers. In Vietnamese, e.g.
an "a"
> with "circumflex" is a completely different character than an "a"
alone.
> This is, why some circles in Vietnam prefer what I call
"VN-combined": base
> character and modifier pre-composed, tone mark combining.
> (2) Converting
> Inside of Vietnam, in the past, there were mainly two different
encodings used:
> - "TCVN-ABC": Fully pre-composed, but a separate font for some upper
case
> characters
> - "VNI": Mainly using combining characters
> When converting old documents (office and web) to Unicode, the
question
> will be, whether the tools will do any normalization (especially in
case of
> VNI), or just only re-map [combining] character by [combining]
character.
>
> And to make things worse, it seems, that MS prefers the combining
way,
> saying that their sorting, spell check, word wrap etc. works that
way
>
> Vietnam plans to make Unicode compulsory for state offices by middle
of 2002.
> I have been asked to advise, and volunteered to take mainly care
about
> Internet issues.
>
> Right now, in Vietnam they are still discussing, whether they should
> require a specific normalization, and if so, which one of the four
possible
> candidates.
>
> According to W3C's draft at
http://www.w3.org/TR/charmod/#sec-Normalization
> it seems, that all Web Applications (and that might include search
> engines?) should reject (to be precise: MUST NOT handle) everything
which
> is not NFC. This could mean, that search engines MUST NOT index
pages in
> "not NFC" and reject queries in "not NFC". If they do: fine. If not:
then
> we have probably quite some problems...
>
>
> And since we are already in Vietnamese (to round the things up):
> I am not sure, how e.g. in the introduction to dictionaries or
Vietnamese
> language books, the tonal mark can be printed "alone".

Re: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re: Proposing Fraktur)

2002-01-30 Thread David Starner

On Wed, Jan 30, 2002 at 09:42:08AM +0100, Karl Pentzlin wrote:
> The advantage is that you can encode text to be displayed correctly
> (i.e. according to the obligatory typesetting rules) in Fraktur as
> plain text. You even can display this text correctly in Fraktur or
> Roman without change (as you can encode a Serbocroatian plain text to
> be displayed in Latin or Cyrillic correctly without change).

What happens to the long s? That needs changing if you're talking about
Roman script since the 19th century.

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."




Re: Unicode Search Engines

2002-01-30 Thread John Cowan

Stefan Probst wrote:


> (1) Sorting
> It is said, that in sorting, all combining marks should be disregarded.


Discarded *as primary differences*.  In other words, all a's come before
all b's (not just initially, but at every point in the string), and the
distinction between various kinds of a's with marks
is considered only if all the other letters are the same.

But sorting, unlike decomposition, positively *requires* per-language
tailoring, and proper i18n sort routines always support it.
Again the Scandinavian example is relevant: the "accented letters"
are not only really primary letters (and so get a primary difference
from their non-accented counterparts), but also sort at the end
of the alphabet.


> While in Vietnamese this is OK for the (combining) tone marks, it is 
> absolutely not OK for the (combining) modifiers. In Vietnamese, e.g. an 
> "a" with "circumflex" is a completely different character than an "a" 
> alone.


Right enough.  So a-circ has a primary difference from a in
VN tailoring.


> This is, why some circles in Vietnam prefer what I call "VN-combined": 


That is based on the naive notion, which does not even work for English,
that binary sorting can ever be culturally correct sorting.  It can't.


> And since we are already in Vietnamese (to round the things up):
> I am not sure, how e.g. in the introduction to dictionaries or 
> Vietnamese language books, the tonal mark can be printed "alone". One 
> solution might be to combine them with a "space", but at present, this 
> does not work always.


When does it not?  It is the standard Unicode thing to do.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_





Typographical Distinctions

2002-01-30 Thread Patrick Rourke

> > And a similar difference is used in all modern European
> > languages: roman for normal text and italics for foreign words.
> The only case I've seen this in use is for some special frases of
> French origin when used in English. Besides, this is no "rule" (i.e.
> you don't have to use italics), while this rule was applied to *all*
> occurences of such words in old Swedish.

In formal typeset English, all foreign words and phrases (i.e., those words
which are not considered to have been naturalized) are written in italic in
a roman text, and in roman in an italic text.  Thus while "role" is set in
roman (a loanword), a word like *roman* (i.e., romance, novel) is set in
italics.  This is a hard and fast rule, and is codified as such in style
manuals (for instance, the Chicago Manual of Style).   While failure to do
so is common, it is bad typographic practice.







Re: Unicode Search Engines

2002-01-30 Thread DougEwell2

In a message dated 2002-01-30 7:28:36 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> It is not a 'fatal flaw'.

I didn't say it was.  I meant to say, but wasn't clear enough in doing so, 
that on other mailing lists the tendency is to blame Unicode for any problem 
or inconvenience in character handling.  (You should know which one I mean, 
Mark; you're on it. :)

-Doug Ewell
 Fullerton, California




Unicode support in IBM AS -400

2002-01-30 Thread Anil Joshi

Hi all,

Well I am struck by east bug. I am looking for information on Unicode
support provided in IBM AS-400. 

The exact questions that I am looking for are
a. Does AS-400 support Unicode if so what kind of support it is. I mean can
I have files names in local language. Say file names in Japanese.
b. Does QShell support multilingual scripts I mean can I write a script that
can contain Japanese file names. 

Bye 
Anil






Questions about Unicode history

2002-01-30 Thread Marco Cimarosti

Hallo.

I am writing a short article about Unicode, and I realized that I don't know
or I am not sure of many Unicode-related facts and dates that I would like
to mention.

I apologize for this is a huge list of questions (and I hope that they are
not all in the FAQ). Anyway, if anybody is in the mood for trivia, I thank
you in advance:


- When did the Unicode project start, and who started it?

- Is it true Han Unification was the core of Unicode, and the idea of an
universal encoding come afterwards?

- Who and when invented the name "Unicode"?

- When did the ISO 10646 project start?

- When did Unicode and ISO 10646 merge?

- What is the name of the GB and JIS standards that have the same repertoire
as Unicode?

- When did Unicode stop to be "16 bits"? (I.e., when were surrogates added?)

- I can't remember the version when some scripts were added: Syriac, Thaana,
Sinhala, Tibetan, Myanmar, Ethiopic, Cherokee, Canadian Syllabics, Ogham,
Runes, Khmer, Mongolian, Yi, Etruscan, Gothic, Deseret, CJK ext. A, CJK ext.
B.

- Roughly, how many ideographs are in modern use in extensions A and B?

- Roughly, when will version 3.2 become official?

- Roughly, when will the version 4 book be published?


I also have a few non-Unicode questions:


- When was ASCII first published and by whom?

- What standard was current before ASCII? (BAUDOT, is it?) How many bits did
it use?

- Did the ASCII standard expire, and when?

- When was ISO 646 published?

- I think that ISO 646 expired. When?

- When was ISO 8859 published?

- When did the first double-byte encoding appear?

- Are OpenType fonts currently implemented in any platform other than
Windows?


Thanks again, in advance.

_ Marco




RE: Questions about Unicode history

2002-01-30 Thread Magda Danish (Unicode)

Hi Marco,

I am currently working on a few web pages that talk about the Unicode
history. They are not publicly accessible yet but I'm sure they hold the
answers to most of your questions. I will email you the temporary url in
a separate email.

Regards,
Magda.

-Original Message-
From: Marco Cimarosti [mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, January 30, 2002 9:29 AM
To: [EMAIL PROTECTED]
Subject: Questions about Unicode history


Hallo.

I am writing a short article about Unicode, and I realized that I don't
know or I am not sure of many Unicode-related facts and dates that I
would like to mention.

I apologize for this is a huge list of questions (and I hope that they
are not all in the FAQ). Anyway, if anybody is in the mood for trivia, I
thank you in advance:


- When did the Unicode project start, and who started it?

- Is it true Han Unification was the core of Unicode, and the idea of an
universal encoding come afterwards?

- Who and when invented the name "Unicode"?

- When did the ISO 10646 project start?

- When did Unicode and ISO 10646 merge?

- What is the name of the GB and JIS standards that have the same
repertoire as Unicode?

- When did Unicode stop to be "16 bits"? (I.e., when were surrogates
added?)

- I can't remember the version when some scripts were added: Syriac,
Thaana, Sinhala, Tibetan, Myanmar, Ethiopic, Cherokee, Canadian
Syllabics, Ogham, Runes, Khmer, Mongolian, Yi, Etruscan, Gothic,
Deseret, CJK ext. A, CJK ext. B.

- Roughly, how many ideographs are in modern use in extensions A and B?

- Roughly, when will version 3.2 become official?

- Roughly, when will the version 4 book be published?


I also have a few non-Unicode questions:


- When was ASCII first published and by whom?

- What standard was current before ASCII? (BAUDOT, is it?) How many bits
did it use?

- Did the ASCII standard expire, and when?

- When was ISO 646 published?

- I think that ISO 646 expired. When?

- When was ISO 8859 published?

- When did the first double-byte encoding appear?

- Are OpenType fonts currently implemented in any platform other than
Windows?


Thanks again, in advance.

_ Marco





Re: Questions about Unicode history

2002-01-30 Thread John H. Jenkins


On Wednesday, January 30, 2002, at 12:29 PM, Marco Cimarosti wrote:

>
> - Are OpenType fonts currently implemented in any platform other than
> Windows?
>
>

OpenType fonts work without modification on Mac OS X, in that the glyphs 
can be displayed.  Any Mac application can access the OT data in the font,
  parse it, and process it appropriately using public functions.  The one 
piece still missing is automatic support for OT layout data in the system.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jenkins/





Re: Questions about Unicode history

2002-01-30 Thread Eric Muller

Marco Cimarosti wrote:

> - Are OpenType fonts currently implemented in any platform other than
> Windows?

FreeType implements OpenType, including layout. By construction, FreeType only
requires an ANSI C implementation, and was written with embedded systems in
mind. Thus, the answer to your question could be "all".

Eric.






Re: Questions about Unicode history

2002-01-30 Thread John Hudson

At 09:29 1/30/2002, Marco Cimarosti wrote:

>- Are OpenType fonts currently implemented in any platform other than
>Windows?

'OpenType support' means a number of different things.

Support for the font file format and rasterisation of the TT or CFF 
outlines is widespread, including Windows, OSX (native), earlier Mac 
systems (CFF only, using ATM), and implementations of FreeType.

Support for individual OpenType Layout typographic features varies from 
application to application.

Support for script shaping features and character-level pre-formatting, 
e.g. for Indic scripts, is supported in Windows apps that use Uniscribe for 
text processing, and I believe the FreeType developers have also been 
working on Indic shaping although I am not sure if this has been released yet.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]

... es ist ein unwiederbringliches Bild der Vergangenheit,
das mit jeder Gegenwart zu verschwinden droht, die sich
nicht in ihm gemeint erkannte.

... every image of the past that is not recognized by the
present as one of its own concerns threatens to disappear
irretrievably.
   Walter Benjamin





Re: Questions about Unicode history

2002-01-30 Thread Kenneth Whistler

Marco,

I'll answer as many of your questions as I can, and will
cc this to the unicode list (in part to forestall a gazillion
"Well, I think maybe X" responses).

--Ken

> - When did the Unicode project start, and who started it?

The detailed history for this will soon be available on the
Unicode website. The short answer is that Joe Becker (Xerox) and
Lee Collins (Apple) were highly instrumental in getting the
ball rolling on this, and the preliminary work they did,
primarily on Han unification, dated from 1987.

However, "the Unicode project" had many beginnings -- many points
where you could mark a milestone in its early development. And
the Unicode Consortium celebrated a number of 10-year
anniversaries, starting from 1998 and continuing through last year.

> 
> - Is it true Han Unification was the core of Unicode, and the idea of an
> universal encoding come afterwards?

The effort by Xerox and Apple to do a Han unification was key to
the motivation that eventually led to a serious effort to actually
*do* Unicode and then to establish the Unicode Consortium to
standardize and promote it. However, the idea of a universal encoding
predated that considerably. In some respects the Xerox Character Code
Standard (XCCS) was a serious attempt at providing a universal
character encoding (although it did not include a unified Han
encoding, but only Japanese kanji). XCCS 2.0 (1980) contained, in
addition to Japanese kanji: Latin (with IPA), Hiragana, Bopomofo, Katakana,
Greek, Cyrillic, Runic, Gothic, Arabic, Hebrew, Georgian, Armenian,
Devanagari, Hangul jamo, and a wide variety of symbols. The early
Unicoders mined XCCS 2.0 heavily for the early drafts of Unicode 1.0,
and always regarded it as the prototype for a universal encoding.

Additionally, you have to consider that the beginning of the ISO project 
for a Multi-octet Universal Character Set (10646) predated the
formal establishment of Unicode. Part of the impetus for the serious
work to standardize Unicode was, of course, discontent with the
then architecture of the early drafts of 10646.

> 
> - Who and when invented the name "Unicode"?

This one has a definitive answer: Joe Becker coined the term,
for "unique, universal, and uniform character encoding", in 1987.
First documented use is in December, 1987.

> 
> - When did the ISO 10646 project start?

Unfortunately, the document register for early WG2 documents doesn't
have dates for all the early documents, and I don't have all the
early documents to check. But...

The 4th meeting of WG2 was held in London in February, 1986. The
first three meetings were in Geneva, Turin, and London, respectively.
That puts the likely timeframe for the Geneva meeting, and the
establishment of WG2 by SC2 at about 1984. The *only* project for WG2
was 10646.

Some of the older oldtimers on the list may have more exact information
about the early WG2 work.

> 
> - When did Unicode and ISO 10646 merge?

It wasn't a single date that can be pointed to, like the signing
of an armistice. In some respects, Unicode and ISO 10646 are *still*
merging, as modifications and amendments to deal with niggling little
architectural edge cases are worked out.

However the key dates were:

January 3, 1991. Incorporation of the Unicode Consortium, which
   signalled to SC2 that the Unicoders were serious in their
   intentions.

May, 1991. Meeting #19 of WG2 in San Francisco. An ad hoc meeting
   took place between WG2 members and some Unicoders, which paved
   the way for the later "merger" of the standards.

June, 1991. The 10646 DIS 1 was defeated in its ballotting. This left
   the only reasonable way forward an architectural compromise with
   the Unicode Standard, which at that point was in copy edit and
   about to go to press.

June 3, 1991. The date of "10646M proposal draft to merge Unicode and
   10646", by Ed Hart. This was a key document in the resulting
   merger of features.

August, 1991. The Geneva WG2 meeting accepted Han unification, combining
   marks, dropped byte-by-byte restrictions on code values for UCS-2,
   and accepted Unicode repertoire additions. From that point forward,
   the overall aspect of what became ISO/IEC 10646-1:1993 was clear.

> 
> - What is the name of the GB and JIS standards that have the same repertoire
> as Unicode?

GB 13000 has the same repertoire as ISO/IEC 10646-1:1993.
JIS X 0221 has the same repertoire as ISO/IEC 10646-1:1993.

Those two were effectively national publications of 10646. You can
work out the correlations with Unicode from that.

GB 18030:2000 in principle has the same repertoire (but different
encoding) as ISO/IEC 10646-1:2000, i.e. the same as Unicode 3.0.
(But there were small problems in it.) However, the 4-byte form
of GB 18030 maps all Unicode code points, assigned or not, so
it will (in theory, at least) always have the same repertoire
as Unicode.

> 
> - When did Unicode stop to be "16 bits"? (I.e., when were surrogates added?)

In terms of publication, with Unic

Re: Questions about Unicode history

2002-01-30 Thread Otto Stolz

Marco,

some of your questions probalbly are answered in Roman Czyborra's
WWW pages, particularly in
- ,
- ,
- ,
- ,
- .

> - When did Unicode and ISO 10646 merge?


The merger was initiated by an informal meeting of Unicode, and WG2
members, during the JTC1/SC2/WG2 meeting in San Francisco, Cali-
fornia, USA, in May 1991. At that time, ISO DIS 10646 (the 1st one)
was still in ballot, so no formal discussion, let alone an agreement,
was allowed by JTC1's rules.

By mid-July, DIS 10646 was formally voted down (P-members: 8 YES,
11 NO, 2 abstained; O-members: 1 YES, 3 NO, 0 abstained). 9 out
of 14 NO votes mentioned the merger ("only one universal code"),
in their national comments.

The merger, and the basic architecture, were agreed on, at the
ISO-IEC JTC1/Sc2/WG2 meeting in Geneva, Switzerland, August 19th
through 23rd, 1991

In Octobre 1991, ISO SC2 plenary (in Rennes, France) unanimously
authorized WG2 to issue a new DIS 10646 in January 1992 for a
4-month (i. e. shortened) vote.

Best wishes,
   Otto Stolz





Re: Unicode Search Engines

2002-01-30 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

[EMAIL PROTECTED] wrote:
[snip]
> If "sorting" the diacritical marks in NFD results in rearranging the two
> diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of
> Vietnamese orthography, the NFD form may not really be a legitimate way of
> representing the Vietnamese letter.
> 
> For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is,
> in Vietnamese, a circumflexed A to which a tone mark (dot below) has been
> added.  It is not a dotted-below A to which a circumflex has been added.

They are the same thing: an A with a circumflex above and a tone mark below.
The abstract value that a combining sequence represents is an unordered set
of sequences of marks, each sequence containing the marks from a given
combining class. So I don't see the problem - the ordering of marks from
different combining classes is just an encoding artefact, with no semantic
significance, and that is what NFD/NFC implement when considered as an
equivalence relation.

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPFZGzzkCAxeYt5gVAQHTEAf+NM3T6UFF3040DDcIiPq8Lki8mH/50hHH
nN2WeoWUGRgUHhiVI/fOG2jxqdkVIabWiqcRvhs/ZUzLeSl3DraDe9fHqS/Bw7Pq
StOAcNEMl2Pm8l0UdI0NFU9jH1TDeEXaBKOiDm6ndcDnenJcZPLye3DUU3zIs6i9
abc/77niF/MuG6SYYei6k01owH87yWJAlOIXtBYH+GuRgfxxLaTiljsE6ZYXeJoy
ZVUyK8HCks/dXL73/MymOZE9NSyUG4mp0RyS21twutXpajeO/v6nACusXd7E+WQj
TPdz2TKhTA9yVj1InCGXn+yBa/bFtfsJHLBzUNvUledW36YE69yvmg==
=ppFL
-END PGP SIGNATURE-




Re: Beta version

2002-01-30 Thread Kenneth Whistler


> At 20:04 +0100 2002-01-29, Stefan Persson wrote:
> >Concerning glyphs U+0364 and U+0366 at
> >http://www.unicode.org/charts/PDF/U32-0300.pdf: Aren't these the same as
> >U+0308 and U+030A? In old Swedish U+0364 was used for words written in
> >Fraktur (non-loan words), while U+0308 was used for words written in
> >"antikva" (loan words and most personal names).
> 
> They are required for Middle High German texts. 0366 is debatable, 
> but usage is up to the user.
> -- 
> Michael Everson *** Everson Typography *** http://www.evertype.com

Additionally, at this point, the period for feedback on the BETA
for Unicode 3.2 is closed. The UTC meets next week, among other things,
to digest and decide on various issues that have come up from all
the feedback.

However, it should be noted that in any case, the decisions on
which *characters* are to be encoded for Unicode 3.2 are long
past. Those cannot be revisited now -- they are already on the
last legs of publication both for the Unicode Standard and for
10646. What the BETA period has been for is is trying to verify
the correctness of the various data files and the text of the
UAX that will be formally published in March as Unicode 3.2.

People who want to have an impact on what characters should or should
not be standardized in future versions of the Unicode Standard
would be well-advised to be considering the content of the
currently open amendments to 10646 (Amendment 2 to 10646-1 and
Amendment 1 to 10646-2). *Those* are the vehicles for characters
to go into Unicode 4.0, and *those* are where feedback still has
some chance of being taken into account and making a difference
in what gets standardized.

--Ken




RE: Beta version

2002-01-30 Thread Kent Karlsson


COMBINING RING ABOVE and COMBINING LATIN SMALL LETTER O look different
(small "true" ring vs. an o-shape (rarely a "true" circle) a bit larger
than the small ring). The latter is a historic precursor to the former.

COMBINING DIAERESIS and COMBINING LATIN SMALL LETTER E really look
different, though, again, the latter is a historic precursor to the former.

In both cases either of the here contrasted diacritics can be used with
fraktur as well as with "antiqua".  Maybe the use of the "modern" versions
with fraktur is a modernistic abuse, but still.  For "antiqua" both
versions *could* be used freely(!) mixed in a text up to about hundred
years ago or so (I haven't tried to find out exactly when), when the
modern versions took over completely.  Mixing styles might be seen as
bad typography, but it did happen.

Of course, e.g., a, , and a should be ordered the same
at the primary level for the Nordic languages.

/kent k

PS (with regard to the related issue with Vietnamese)

Notice how the UCA (UTS 10) requires that 
be ordered at the primary level as an a (), when a is
tailored at the primary level (e.g. to be near the end of the alphabet),
but the dot is a secondary level difference.  (Unfortunately, 14651 does
not make the same requirement...)


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Michael Everson
> Sent: Wednesday, January 30, 2002 1:12 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Beta version
> 
> 
> At 20:04 +0100 2002-01-29, Stefan Persson wrote:
> >Concerning glyphs U+0364 and U+0366 at
> >http://www.unicode.org/charts/PDF/U32-0300.pdf: Aren't these 
> the same as
> >U+0308 and U+030A? In old Swedish U+0364 was used for words 
> written in
> >Fraktur (non-loan words), while U+0308 was used for words written in
> >"antikva" (loan words and most personal names).
> 
> They are required for Middle High German texts. 0366 is debatable, 
> but usage is up to the user.
> -- 
> Michael Everson *** Everson Typography *** http://www.evertype.com
> 




RE: Questions about Unicode history

2002-01-30 Thread Alistair Vining

Otto Stolz wrote:
>
> some of your questions probalbly are answered in Roman Czyborra's
> WWW pages, particularly in
>
> [czyborra.com addresses snipped]

I just found:
http://www.cwi.nl/~dik/english/codes/stand.html
whose author (Dik Winter) notes that he 'stop[s] approximately where Roman Czyborra
starts'.  Thai EBCDIC, JISCII, 6-bit ISO codes, ASCII-1963 etc.  Looks very thorough
to me, but I wasn't there...

Al.





Re: Beta version

2002-01-30 Thread Stefan Persson

- Original Message -
From: "Kent Karlsson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: den 30 januari 2002 22:41
Subject: RE: Beta version


> Of course, e.g., a, , and a should be ordered the same
> at the primary level for the Nordic languages.

"ä", "æ", "a ¨-above", and "a e-above" should all be sorted the same in
Swedish, no matter whether they're written in capital or small letters. Of
course (?), the "e-above" should always be a small "e". "a e-above" should
not be sorted as "a", as you stated above.

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re: Proposing Fraktur)

2002-01-30 Thread Stefan Persson

- Original Message -
From: "Karl Pentzlin" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: den 29 januari 2002 23:39
Subject: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re:
Proposing Fraktur)


> While in Swedish this is a *tradition* according to Stefan, in German
> it is even a *rule*.

Also in Swedish, this was a rule. But from the end of the 18th century,
people began publishing books in Fraktur *only*, or antiqua *only*. In some
books, the antiqua part was written in italics instead. NOTE: This italic
thing should be considered as a glyph variant.

> Maybe something like a "ROMAN VARIANT SELECTOR" would be appropriate:

In any case, it'd be better to have *two* selectors, one to turn on Fraktur,
and a different one to turn it off. Otherwise, you'd have to put the variant
selector after *every* letter you want to be in antiqua, which would require
quite a lot of space. However, Fraktur is already encoded in the
Mathematical whatever-it's-called block. This variant selector would mean
that lots of characters can be displayed in two *different* ways. I'd prefer
that Fraktur diacritics were added instead, and that the mathematical
letters were to be used for Fraktur texts.

NOTE: Sometimes part of a word is in Fraktur, and a different part in
antiqua. Example: the Swedish word "latin" is a Latin loan word, and should
thus be written in antiqua. However, if you add the Swedish ending "-sk,"
you'll get "latinsk" ("Latin-like"). The ending is Swedish and can, but
doesn't have to, be written in Fraktur. It's up to the author to decide
which.

> This selector could fulfill another important purpose:
>
> If this selector appears after a U+017F (long s), this character is
> only to be displayed as "long s" when it is (by means of a higher level
> protocol) to be displayed in Fraktur. Otherwise it is to be displayed
> as U+0073 (lower case "s").

"Long s" is displayed as "long s" in antiqua words used in Fraktur Swedish.
So this wouldn't work. Instead, one would have to write "s" in German texts.

A comma after a Fraktur word is displayed as *either* "," or "/" (glyph
difference), while a comma after an antiqua word is *always* displayed as
",". So I guess that a Fraktur comma would also have to be added…

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re: Proposing Fraktur)

2002-01-30 Thread Stefan Persson

- Original Message -
From: "Karl Pentzlin" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: den 30 januari 2002 09:42
Subject: Re: Introducing the idea of a "ROMAN VARIANT SELECTOR" (was: Re:
Proposing Fraktur)


> PR>  I think most of these cases, including
> PR> the Fraktur problem, deal with _typesetting_ rules and should thus be
> PR> left to _typesetting_ software, i.e. the now-famous "higher level
> PR> protocol".
>
> The question is, are typesetting rules "part of the script"?
>
> (I mean rules in the sense of obligatory regulations, not guidelines).
> If yes, (in my opinion) the plain text must carry the information that is
> needed to follow them. If no, their execution can be left to higher level
> protocols (which then have to decide whether a word is a foreign word
> [to be set in Roman letters] or a name [to be set in Fraktur letters],
> such at least according to German typesetting rules).

In this case:

* The program would have to know which language it's dealing with, and which
spelling rules are used in the text (in Swedish: free spelling (as
preferred), pre-1905, and post-1905).
*It would have to know every loan word and personal name.

Here's a difficult case:

* "Et:" Latin word. Used in Swedish in cases such as "et cetera." Written in
antiqua
* "Et:" old spelling for "ett" (a, one). Written in Fraktur.

How would the program know which of them I'm referring to?

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: Proposing Fraktur

2002-01-30 Thread Kenneth Whistler

Stefan Persson wrote:

> AFAIK, the criteria for adding any character to the Standard is that there
> should be a difference between the character and all the other characters
> already supported by the Standard. Here we have a such difference, doesn't
> this mean that Fraktur ought to be added to the Standard.

Asmus pretty thoroughly laid out the issues for kana and Fraktur. I won't
say anything further about that.

But stepping back a little further, I would like to point out that the
assertion that:

  "the criteria for adding any character to the Standard is that there
   should be a difference between the character and all the other characters
   already supported by the Standard"  ipsissima verba <== irony warning

begs the questions which arise about the identity of the "character" in
the first place.

Every marking on paper (or papyrus, or clay, or stone, for that matter)
is not necessarily a "character" deserving of encoding as a character
in the universal character encoding, even if I can show systematic differences
between it and existing characters in the standard.

On the one hand, one must show that the differences don't fall within the
range of acceptable variation for an already existing encoded character.
And one must show that the entity in question has some verifiable
existence as an "abstract character", or that some processing requirement
forces consideration of its encoding as a character.

Merely being a distinct glyph is not enough.

> And so what? I thought the meaning of Unicode was that all languages should
> be fully supported in plain text, using one single font to displaying all of
> the characters. With old Swedish, this isn't possible.

I think this misconstrues the mission of Unicode as an encoding. The goal
is to encode sufficient characters to enable the correct and legible
representation of *plain* text in any script (modern or historic).

The goal is not and has never been to enable the plain text representation
of *all* extant and future texts of any form. For that, markup, high-level
layout, and font selection has always been required.

> Again: one language, one font.

No. One font is sufficient for monofont display of a language, tautologously. 
But there is no presumption that any and all text in a language need be
displayed in a single font, or that such a goal would even be desirable.

--Ken

> 
> Stefan




Re: Beta version

2002-01-30 Thread David Starner

On Wed, Jan 30, 2002 at 01:35:53PM -0800, Kenneth Whistler wrote:
> People who want to have an impact on what characters should or should
> not be standardized in future versions of the Unicode Standard
> would be well-advised to be considering the content of the
> currently open amendments to 10646 (Amendment 2 to 10646-1 and
> Amendment 1 to 10646-2). 

Is there someplace where we, the unwashed masses, have access to these
documents?

-- 
David Starner - [EMAIL PROTECTED], dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."




Re: Beta version

2002-01-30 Thread DougEwell2

In a message dated 2002-01-30 21:48:47 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Is there someplace where we, the unwashed masses, have access to these
> documents?

Yeah.  Good question.  I've found some of them myself, in particular the code 
charts, by poking around the WG2 site at dkuug.dk and in other places.  If 
they're on the public Internet, I have every right to see them and download 
them, but they clearly weren't put there for that purpose.

-Doug Ewell
 Fullerton, California