Question about U+170D, which I hope will become TAGALOG LETTER RA

2019-06-11 Thread Fred Brennan via Unicode
Greetings,

I write this letter with questions regarding a proposal I hope to make for the 
encoding of TAGALOG LETTER RA, which we locally know as the baybayin letter 
"ra", at U+170D. Many fonts are already using this unencoded codepoint for 
TAGALOG LETTER RA in breach of the standard. TAGALOG LETTER RA looks like 
TAGALOG LETTER DA, U+1707, with an extra stroke. For examples, see Norman de 
los Santos' Unicode baybayin fonts.[2] Paul Morrow's fonts, which are used on 
the Philippine peso, also include "ra" outside of the ones meant to be exact 
digitizations of the first baybayin fonts.[4]

I had previously assumed that this space had been left open in anticipation of 
the future encoding of TAGALOG LETTER RA, but that this hadn't happened due to 
apathy; however I've since been informed that the space was left open as an 
oversight of sorts, considering that four Philippine scripts were encoded at 
once as a result of WG2 proposal N1933.[1]

I hope to request this as the Google Noto developers will not follow the de 
facto standard unless it is given the Consortium's approval.[3]

My questions are:

• How old do I need to prove the letter is? Baybayin "ra" is not used in 
writing Old Tagalog and is not used in the earliest Tagalog texts. However, it 
certainly has existed since at least 1985,[4; under heading Bikol Mintz] and 
perhaps decades earlier.
• May I use signs and fonts as evidence? What types of documents may I use?
• Would anyone volunteer to help me write this proposal, or check it over 
before I send it?

Thank you.

 [1]: https://www.unicode.org/L2/L1999/n1933.pdf
 [2]: http://nordenx.blogspot.com/p/downloads.html
 [3]: https://github.com/googlefonts/noto-fonts/issues/1185
 [4]: http://paulmorrow.ca/fonts.htm 






Re: Update to the second question summary (was: A sign/abbreviation for "magister")

2018-12-02 Thread Hans Åberg via Unicode


> On 2 Dec 2018, at 20:29, Janusz S. Bień via Unicode  
> wrote:
> 
> On Sun, Dec 02 2018 at 10:33 +0100, Hans Åberg via Unicode wrote:
>> 
>> It was common in the 1800s to singly and doubly underline superscript
>> abbreviations in handwriting according to [1-2], and [2] also mentions
>> the abbreviation discussed in this thread.
> 
> Thank you very much for this reference to the very abbreviation! I
> looked up Wikipedia but I haven't read it carefully enough :-(

Quite of a coincidence, as I was looking at the article topic, and it happened 
to have this remark embedded!

>> 1. https://en.wikipedia.org/wiki/Ordinal_indicator
>> 2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1





Update to the second question summary (was: A sign/abbreviation for "magister")

2018-12-02 Thread Janusz S. Bień via Unicode
On Sun, Dec 02 2018 at 10:33 +0100, Hans Åberg via Unicode wrote:
>> On 30 Oct 2018, at 22:50, Ken Whistler via Unicode  
>> wrote:
>> 
>> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
>>> but we can't seem to agree on how to encode its abbreviation. 
>> 
>> For what it's worth, "mgr" seems to be the usual abbreviation in Polish for 
>> it.
>
> It was common in the 1800s to singly and doubly underline superscript
> abbreviations in handwriting according to [1-2], and [2] also mentions
> the abbreviation discussed in this thread.

Thank you very much for this reference to the very abbreviation! I
looked up Wikipedia but I haven't read it carefully enough :-(

>
> 1. https://en.wikipedia.org/wiki/Ordinal_indicator
> 2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien



Preformatted superscript in ordinary text, paleography and phonetics using Latin script (was: Re: A sign/abbreviation for "magister" - third question summary)

2018-11-07 Thread Marcel Schneider via Unicode

On 06/11/2018 12:04, Janusz S. Bień via Unicode wrote:


On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:

Hi!

On the over 100 years old postcard

https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6

you can see 2 occurences of a symbol which is explicitely explained (in
Polish) as meaning "Magister".



[...]


The third and the last question is: how to encode this symbol in
Unicode?



A constructive answer to my question was provided quickly by James Kass:

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:

Mr͇ / M=ͬ


I answered:

On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bień via Unicode wrote:

[...]


For me only the latter seems acceptable. Using COMBINING LATIN SMALL
LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
the base character. However in the lack of a better solution I can live
with it :-)

An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
supporting it are rather rare.


and Philippe Verdy commented:

On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote:

[...]



There's a third alternative, that uses the superscript letter r,
followed by the combining double underline, instead of the normal
letter r followed by the same combining double underline.


Some comments were made also by Michael Everson:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]


I would encode this as Mʳ if you wanted to make sure your data
contained the abbreviation mark. It would not make sense to encode it
as M=ͬ or anything else like that, because the “r” is not modifying a
dot or a squiggle or an equals sign.  The dot or squiggle or equals
sign has no meaning at all. And I would not encode it as Mr͇, firstly
because it would never render properly and you might as well encode it
as Mr. or M:r, and second because in the IPA at least that character
indicates an alveolar realization in disordered speech. (Of course it
could be used for anything.)


FYI, I decided to use the encoding proposed by Philippe Verdy (if I
understand him correctly):

Mʳ̳

i.e.

'LATIN CAPITAL LETTER M' (U+004D)
'MODIFIER LETTER SMALL R' (U+02B3)
'COMBINING DOUBLE LOW LINE' (U+0333)

for purely pragmatic reasons: it is rendered quite well in my
Emacs. According to the 'fc-search-codepoint" script, the sequence is
supported on my computer by almost 150 fonts, so I hope to find in due
time a way to render it correctly also in XeTeX. I'm also going to add
it to my private named sequences list
(https://bitbucket.org/jsbien/unicode4polish).

The same post contained a statement which I don't accept:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]


The squiggle in your sample, Janusz, does not indicate anything; it is
only a decoration, and the abbreviation is the same without it.


One of the reasons I disagree was described by me in the separate thread
"use vs mention":

https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html

There were also some other statements which I find unacceptable:

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]


The abbreviation in the postcard, rendered in plain text, is "Mr".


He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at
9:38 GMT (and earlier in a private mail).

I understand that both of them by "plane text" mean Unicode.


On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:


  You could use the various hacks you've discussed, with modifier
letters; but that is not "encoding", that is "abusing Unicode to do
markup". At least, that's the view I take!


and was supported by Asmus Freytag on Wed, Oct 31 2018 at  3:12
-0700.

The latter elaborated his view later and I answered:

On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bień via Unicode wrote:

On Fri, Nov 02 2018 at  5:09 -0700, Asmus Freytag via Unicode wrote:


[...]


All else is just applying visual hacks


I don't mind hacks if they are useful and serve the intended purpose,
even if they are visual :-)


[...]


at the possible cost of obscuring the contents.


It's for the users of the transcription to decide what is obscuring the
text and what, to the contrary, makes the transcription more readable
and useful.


Please note that it's me who makes the transcription, it's me who has a
vision of the future use and users, and in consequence it's me who makes
the decision which aspects of text to encode. Accusing me of "abusing
Unicode" will not stop me from doing it my way.

I hope that at least James Kass understands my attitude:

On Mon, Oct 29 2018 at  7:57 GMT, James Kass via Unicode wrote:

[...]


If I were entering plain text data from an old post card, I'd try to
keep the data as close to the source as possible. Because that would
be my purpose. Others might have different purposes.


There were presented also some ideas which I would call "futuristic":
in

A sign/abbreviation for "magister" - third question summary

2018-11-06 Thread Janusz S. Bień via Unicode


On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>

[...]

> The third and the last question is: how to encode this symbol in
> Unicode?


A constructive answer to my question was provided quickly by James Kass:

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> Mr͇ / M=ͬ

I answered:

On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bień via Unicode wrote:

[...]

> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
> the base character. However in the lack of a better solution I can live
> with it :-)
>
> An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
> supporting it are rather rare. 

and Philippe Verdy commented:

On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote:

[...]

>
> There's a third alternative, that uses the superscript letter r,
> followed by the combining double underline, instead of the normal
> letter r followed by the same combining double underline.  

Some comments were made also by Michael Everson:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]

> I would encode this as Mʳ if you wanted to make sure your data
> contained the abbreviation mark. It would not make sense to encode it
> as M=ͬ or anything else like that, because the “r” is not modifying a
> dot or a squiggle or an equals sign.  The dot or squiggle or equals
> sign has no meaning at all. And I would not encode it as Mr͇, firstly
> because it would never render properly and you might as well encode it
> as Mr. or M:r, and second because in the IPA at least that character
> indicates an alveolar realization in disordered speech. (Of course it
> could be used for anything.)

FYI, I decided to use the encoding proposed by Philippe Verdy (if I
understand him correctly):

Mʳ̳

i.e.

'LATIN CAPITAL LETTER M' (U+004D)
'MODIFIER LETTER SMALL R' (U+02B3)
'COMBINING DOUBLE LOW LINE' (U+0333)

for purely pragmatic reasons: it is rendered quite well in my
Emacs. According to the 'fc-search-codepoint" script, the sequence is
supported on my computer by almost 150 fonts, so I hope to find in due
time a way to render it correctly also in XeTeX. I'm also going to add
it to my private named sequences list
(https://bitbucket.org/jsbien/unicode4polish).

The same post contained a statement which I don't accept:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]

> The squiggle in your sample, Janusz, does not indicate anything; it is
> only a decoration, and the abbreviation is the same without it.

One of the reasons I disagree was described by me in the separate thread
"use vs mention":

https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html

There were also some other statements which I find unacceptable:

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]

> The abbreviation in the postcard, rendered in plain text, is "Mr".

He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at
9:38 GMT (and earlier in a private mail).

I understand that both of them by "plane text" mean Unicode.


On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:

>  You could use the various hacks you've discussed, with modifier
> letters; but that is not "encoding", that is "abusing Unicode to do
> markup". At least, that's the view I take!

and was supported by Asmus Freytag on Wed, Oct 31 2018 at  3:12
-0700.

The latter elaborated his view later and I answered:

On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bień via Unicode wrote:
> On Fri, Nov 02 2018 at  5:09 -0700, Asmus Freytag via Unicode wrote:

[...]

>> All else is just applying visual hacks
>
> I don't mind hacks if they are useful and serve the intended purpose,
> even if they are visual :-)

[...]

>> at the possible cost of obscuring the contents.
>
> It's for the users of the transcription to decide what is obscuring the
> text and what, to the contrary, makes the transcription more readable
> and useful.

Please note that it's me who makes the transcription, it's me who has a
vision of the future use and users, and in consequence it's me who makes
the decision which aspects of text to encode. Accusing me of "abusing
Unicode" will not stop me from doing it my way.

I hope that at least James Kass understands my attitude:

On Mon, Oct 29 2018 at  7:57 GMT, James Kass via Unicode wrote:

[...]

> If I were entering plain text data from an old post card, I'd try to
> keep the data as close to the so

A sign/abbreviation for "magister" - second question summary

2018-11-06 Thread Janusz S. Bień via Unicode


On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".

[...]

> The second question is: are you familiar with such or a similar symbol?
> Have you ever seen it in print?

Later I provided some additional information:

On Sat, Oct 27 2018 at 16:09 +0200, Janusz S. Bień via Unicode wrote:
>
> The postcard is from the front of the first WW written by an
> Austro-Hungarian soldier. He explaines the meaning of the abbreviation
> to his wife, so looks like the abbreviation was used but not very
> popular.

On Sat, Oct 27 2018 at 20:25 +0200, Janusz S. Bień via Unicode wrote:

[...]

> In the meantime I looked up some other postcards written by the same
> person i found several other abbreviation including № 'NUMERO SIGN'
> (U+2116) written in the same way, i.e. with a double instead of a single
> line.

The similarity to № 'NUMERO SIGN' was mentioned quite often in the
thread, there seem to be no need to quote all this mentions here.

A more general observation was formulated by Richard Wordingham:

On Sun, Oct 28 2018 at  8:13 GMT, Richard Wordingham via Unicode wrote:

[...]

> The notation is a quite widespread format for abbreviations.  the
> first letter is normal sized, and the subsequent letter is written in
> some variety of superscript with a squiggle underneath so that it
> doesn't get overlooked.  

Various examples of such abbreviations were also mentioned several times
in the thread, but again there seem to be no need to quote all this
mentions here.

Nobody however reported any other occurence of the symbol in question.

Best regards

Janusz


-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien



A sign/abbreviation for "magister" - first question summary

2018-11-06 Thread Janusz S. Bień via Unicode
On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>
> First question is: how do you interpret the symbol? For me it is
> definitely the capital M followed by the superscript "r" (written in an
> old style no longer used in Poland), but there is something below the
> superscript. It looks like a small "z", but such an interpretation
> doesn't make sense for me.

I've got almost immediately two complementary answers:

On Sat, Oct 27 2018 at 9:11 -0400, Robert Wheelock wrote:

> It is constructed much like the symbol for numero—only with a capital
>  accompanied by a superscript small  > having an underbar (or
> double underbar).


On Sat, Oct 27 2018 at  6:58 -0700, Asmus Freytag via Unicode wrote:

[...]

> My suspicion would be that the small "z" is rather a "=" that
> acquired a connecting stroke as part of quick handwriting.  A./

and on the same day this interpretation was supported by Philippe Verdy:

On Sat, Oct 27 2018 at 20:35 +0200, Philippe Verdy via Unicode wrote:

[...]

> I have the same kind of reading, the zigzagging stroek is an
> hnadwritten emphasis of the uperscript r above it (explicitly noting
> it is terminating the abbreviation), jut like the small underline that
> happens sometimes below the superscript o in the abbreviation of
> "numero" (as well sometimes there was not just one but two small
> underlines, including in some prints).
>
> This sample is a perfect example of fast cursive handwritting (due to
> high variability of all other letter shapes, sizes and joinings, where
> even the capital M is written as two unconnected strokes), and it's
> not abnormal to see in such condition this cursive joining between the
> two underlining strokes so that it looks like a single zigzag.

Later it was summarized by James Kass:

On Fri, Nov 02 2018 at  2:59 GMT, James Kass via Unicode wrote:
> Alphabetic script users write things the way they are spelled and
> spell things the way they are written.  The abbreviation in question
> as written consists of three recognizable symbols.  An "M", a
> superscript "r", and an equal sign (= two lines).  It can be printed,
> handwritten, or in fraktur; it will still consist of those same three
> recognizable symbols.
>
> We're supposed to be preserving the past, not editing it or revising
> it.

It was commented by Julian Bradfield:

On Fri, Nov 02 2018 at  8:54 GMT, Julian Bradfield via Unicode wrote:

[...]

> That's not true. The squiggle under the r is a squiggle - it is a
> matter of interpretation (on which there was some discussion a hundred
> messages up-thread or so :) whether it was intended to be = .
> Just as it is a matter of interpretation whether the superscript and
> squiggle were deeply meaningful to the writer, or whether they were
> just a stylistic flourish for Mr.

The abbreviation in question definitely consists of three symbols: an
"M", a superscript "r" and the third one, which I think was best
described by Robert Wheelock as double (under)bar, with the connecting
stroke mentioned first by Asmus Freytag.

This third element was referred to, also by myself, as a squiggle, but
after looking up the definition of the word in a dictionary

  a short line that has been written or drawn and that curves and
  twists in a way that is not regular

I think this is a misnomer. Unfortunately I have no better proposal.

Best regards

Janusz


-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien



Re: Shortcuts question

2018-09-17 Thread Philippe Verdy via Unicode
Note: CLDR concentrates on keyboard layout for text input. Layouts for
other functions (such as copy-pasting, gaming controls) are completely
different (and not necessarily bound directly to layouts for text, as they
may also have their own dedicated physical keys or users can reprogram
their keyboard for this; for gaming, softwares should all have a way to
customize the layout according to users need, and should provide
reasonnable defaults for at least the 3 base layouts: QWERTY, AZERTY and
QWERTZ, but I've never seen any game whose UI was tuned for Dvorak)

Le lun. 17 sept. 2018 à 16:42, Marcel Schneider  a
écrit :

> On 17/09/18 05:38 Martin J. Dürst wrote:
> [quote]
> >
> > From my personal experience: A few years ago, installing a Dvorak
> > keyboard (which is what I use every day for typing) didn't remap the
> > control keys, so that Ctrl-C was still on the bottom row of the left
> > hand, and so on. For me, it was really terrible.
> >
> > It may not be the same for everybody, but my experience suggests that it
> > may be similar for some others, and that therefore such a mapping should
> > only be voluntary, not default.
>
> Got it, thanks!
>
> Regards,
>
> Marcel
>


Re: Shortcuts question

2018-09-17 Thread Marcel Schneider via Unicode
On 17/09/18 05:38 Martin J. Dürst wrote:
[quote]
> 
> From my personal experience: A few years ago, installing a Dvorak 
> keyboard (which is what I use every day for typing) didn't remap the 
> control keys, so that Ctrl-C was still on the bottom row of the left 
> hand, and so on. For me, it was really terrible.
> 
> It may not be the same for everybody, but my experience suggests that it 
> may be similar for some others, and that therefore such a mapping should 
> only be voluntary, not default.

Got it, thanks!

Regards,

Marcel



Re: Shortcuts question

2018-09-16 Thread Martin J. Dürst via Unicode

On 2018/09/16 21:08, Marcel Schneider via Unicode wrote:


An additional level of complexity is induced by ergonomics. so that most 
non-Latin layouts may wish to stick
with QWERTY, and even ergonomic layouts in the footprints of August Dvorak 
rather than Shai Coleman are
likely to offer variants with legacy Virtual Key mapping instead of staying in 
congruency with graphics optimized
for text input.


From my personal experience: A few years ago, installing a Dvorak 
keyboard (which is what I use every day for typing) didn't remap the 
control keys, so that Ctrl-C was still on the bottom row of the left 
hand, and so on. For me, it was really terrible.


It may not be the same for everybody, but my experience suggests that it 
may be similar for some others, and that therefore such a mapping should 
only be voluntary, not default.


Regards,   Martin.



Re: Shortcuts question

2018-09-16 Thread Philippe Verdy via Unicode
For games, the mnemonic meaning of keys are unlikely to be used because
gamers prefer an ergonomic placement of their fingers according to the
physical position for essential commands.
But this won't apply to control keys, as these commands should be single
keystrokes and pressing two keys instead of one would be unpractical and
would be a disavantage when playing.

That's why the four most common 4 direction keys A/D/S/W on a QWERTY layout
will become Q/D/S/Z on a French AZERTY layout. Games that use logical key
layouts based on QWERTY are almost unplayable if there's no interface to
customize these 4 keys. So games preferably use the virtual keys instead
for these commands, or will include builtin layouts adapted for AZERTY and
QWERTZ-based layouts and still display the correct keycaps in the UI: games
normally don't force the switch to another US layout, so they still need to
use the logical layout, simply because they also need to allow users to
input real text and not jsut gaming commands (for messaging, or for
inputing custom players/objects created in the game itself, or to fill-in
user profiles, or input a registration email or to perform online logon
with the correct password), in which case they will also need to support
characters entered with control keys (AltGr, Shift, Control...), or with a
standard tactile panel on screen which will still display the common
localized layouts.

There are difficulties in games when some of their commands are mapped to
something else than just basic Latin letters (including decimal digits : on
a French AZERTY keyboard, the digits are composed by pressing Shift, or in
ShiftLock mode (there's no CapsLock mode as this ShiftLock is also released
when pressing Shift: just like on old French mechanical typewriters,
pressing ShiftLock again did not release it, and this ShiftLock applied to
all keys on the keyboard, including punctuation keys. On PC keyboards,
ShiftLock does not apply to the numeric pad which has its separate NumLock,
now largely redundant and that most users would like to disable completely
each time there's a numeric pad separated from the directional pad, on
these extended keyboards, NumLock is just a nuisance, notably on OS logon
screen when Windows turns it off by default unless the BIOS locks it at
boot time, and lot of BIOS don't do that or don't have the option to set it
permanently).



Le dim. 16 sept. 2018 à 14:18, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 15/09/18 15:36, Philippe Verdy wrote:
> […]
> > So yes all control keys are potentially localisable to work best with
> the base layout anre remaining mnemonic;
> > but the physical key position may be very different.
>
> An additional level of complexity is induced by ergonomics. so that most
> non-Latin layouts may wish to stick
> with QWERTY, and even ergonomic layouts in the footprints of August Dvorak
> rather than Shai Coleman are
> likely to offer variants with legacy Virtual Key mapping instead of
> staying in congruency with graphics optimized
> for text input. But again that is easier on Windows, where VKs are
> remapped separately, than on Linux that
> appears to use graphics throughout to process application shortcuts, and
> only modifiers can be "preserved" for
> further processing, no underlying letter map that AFAIU appears not to
> exist on Linux.
>
> However, about keyboarding, that may be technically too detailed for this
> List, so that I’ll step out of this thread
> here. Please follow up in parallel thread on CLDR-users instead.
>
> https://unicode.org/pipermail/cldr-users/2018-September/000837.html
>
> Thanks,
>
> Marcel
>
>
>


Re: Shortcuts question

2018-09-16 Thread Marcel Schneider via Unicode
On 15/09/18 15:36, Philippe Verdy wrote:
[…]
> So yes all control keys are potentially localisable to work best with the 
> base layout anre remaining mnemonic;
> but the physical key position may be very different.

An additional level of complexity is induced by ergonomics. so that most 
non-Latin layouts may wish to stick 
with QWERTY, and even ergonomic layouts in the footprints of August Dvorak 
rather than Shai Coleman are 
likely to offer variants with legacy Virtual Key mapping instead of staying in 
congruency with graphics optimized 
for text input. But again that is easier on Windows, where VKs are remapped 
separately, than on Linux that 
appears to use graphics throughout to process application shortcuts, and only 
modifiers can be "preserved" for
further processing, no underlying letter map that AFAIU appears not to exist on 
Linux.

However, about keyboarding, that may be technically too detailed for this List, 
so that I’ll step out of this thread 
here. Please follow up in parallel thread on CLDR-users instead.

https://unicode.org/pipermail/cldr-users/2018-September/000837.html

Thanks,

Marcel




Re: Shortcuts question

2018-09-15 Thread Philippe Verdy via Unicode
Le ven. 7 sept. 2018 à 05:43, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 07/09/18 02:32 Shriramana Sharma via Unicode wrote:
> >
> > Hello. This may be slightly OT for this list but I'm asking it here as
> it concerns computer usage with multiple scripts and i18n:
>
> It actually belongs on CLDR-users list. But coming from you, it shall
> remain here while I’m posting a quick answer below.
>
> > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
> "tout" io Ctrl+A for "all"?
>
> No, Ctrl+A remains Ctrl+A on a French keyboard.
>

Yes but the location on the keyboard maps to the same as CTRL+Q on a Qwerty
layout: CTRL+ASCII letter are mapped according to the layout of the letter
(without pressing CTRL) on the localized keyboard. Some keyboard layouts
don't have all the basic Latin letters becaues their language don't need it
(e.g. it may only have one of Q or K, but no C, or it may have no W, or
some letters may be holding combined diacritics or could be ligatures, but
usuall the basic Latin letter is still accessible by pressing another
control key or by switching the layout mode.

On non Latin keyboard layouts there's much more freedom, and CTRL+A may be
localized according to the main base letter assigned to the key (the
position of Latin letter is not always visible).

On tactile layouts you cannot guess where CTRL+Latin letter is located,
actually it may be accessible very differently on a separate layout for
controls, where they will be translated: the CTRL key is not necessarily
present, replaced usually by a single key for input mode selection (which
may be switching languages, or to emojis, or to
symbols/punctuations/digits)...

The problematic control keys are those like "CTRL+[" (assuming ASCII as the
base layout) where "[" is not present or mapped very differently. As well
"CTRL+1"..."CTRL+0" may conflict with the assignment of ASCII controls like
"CTRL+[".

So yes all control keys are potentially localisable to work best with the
base layout anre remaining mnemonic; but the physical key position may be
very different.


Re: Shortcuts question (is: Thread transfer info)

2018-09-07 Thread Marcel Schneider via Unicode
Hello,

I’ve followed up on CLDR-users:

https://unicode.org/pipermail/cldr-users/2018-September/000837.html

As a sidenote — It might be hard to get a selection of discussions 
actually happen on CLDR-users instead of Unicode Public mail list, 
as long as subscribers of this list don’t necessarily subscribe to 
the other list, too, that still has way less subscribers than Unicode Public.

Regards,

Marcel



Re: Shortcuts question

2018-09-07 Thread Christoph Päper via Unicode
Shriramana Sharma:
> 
> 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
> "tout" io Ctrl+A for "all"?

Some are, many are not. For instance, some text editors use a modifier key with 
F and K instead of B and I for bold ("fett") and italic ("kursiv").

> 2) How about when the shortcuts are the Alt+ combinations referring to
> underlined letters in actual user visible strings?

Those depend much more language dependent than Ctrl/Cmd shortcuts.

> 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt
> the other XCV shortcuts) Z key or the Y key which is in the physical
> position of the QWERTY Z key (and close to the other XCV shortcuts)?

For some shortcuts the key position is more important (e.g. the one left from 
the 1 key), for others it's the initial / conventional letter of the command. 
Most QWERTZ users are not used to expect the undo shortcut (Z) next to the keys 
for cut (X), copy (C) and paste (V). By the way, accompanying redo is 
notoriously inconsistent, sometimes Y, sometimes Shift+Z.

More serious problems arise with non-letter keys. For instance, square brackets 
[ and ] are readily available on the US / English keyboard layout, but require 
modifier keys like Shift or Alt on many other keyboard layouts, which may be 
the same ones as for the curly braces { and }. This means, some seemingly 
simple and intuitive shortcuts on an English keyboard become cumbersome on 
international ones.


Re: Shortcuts question

2018-09-06 Thread Marcel Schneider via Unicode
On 07/09/18 02:32 Shriramana Sharma via Unicode wrote:
> 
> Hello. This may be slightly OT for this list but I'm asking it here as it 
> concerns computer usage with multiple scripts and i18n:

It actually belongs on CLDR-users list. But coming from you, it shall remain 
here while I’m posting a quick answer below.

> 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" 
> io Ctrl+A for "all"?

No, Ctrl+A remains Ctrl+A on a French keyboard.

> 2) How about when the shortcuts are the Alt+ combinations referring to 
> underlined letters in actual user visible strings?

I don’t know, but the accelerator shortcuts usually process text input, so it 
would be up to the vendor to keep them in sync.

> 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the 
> other XCV shortcuts) Z key or the Y key
> which is in the physical position of the QWERTY Z key (and close to the other 
> XCV shortcuts)?

On Windows, that this question refers to, virtual keys move around with 
graphics on Latin keyboards. While Ctrl+Z on QWERTZ is 
not handy, I can tell that it is Ctrl+Z on AZERTY with the key having the Z on 
it and typing "z". The latter is most relevant on Linux
where graphics are used even to process the Ctrl+ shortcuts.

> 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic 
> or Japanese?

On Windows as they depend on Virtual Keys, they may be laid out on an 
underlying QWERTY basis. The same may apply on macOS, 
where distinct levels are present in the XML keylayout (and likewise in 
system-shipped layouts) to map the letters associated with
shortcuts, regardless of the script. On Linux, shortcuts are reported not to 
work on some non-Latin keyboard layouts (because key names
are based on ISO key positions, and XKB doesn’t appear to use a "Group0" level 
to map the shortcut letters; needs to be investigated).

> 4a) I mean how are they displayed on screen? 

My short answer is: I’ve got no experience; maybe using Latin letters and 
locale labels.

> 4b) Like #1 above, are they changed per language?

Non-Latin scripts typically use QWERTY for ASCII input, so shortcuts may not be 
changed per language.

> 4c) Like #2 above, how about for user visible shortcuts?

Again I’m leaving this over to non-Latin script experts.

> (In India since English is an associate official language, most computer 
> users are at least conversant with basic English
> so we use the English/QWERTY shortcuts even if the keyboard physically shows 
> an Indic script.)

The same applies to virtually any non-Latin locale. Michael Kaplan reported 
that only on Latin keyboards VKs move around.

> Thanks!

You are welcome.

Marcel



Shortcuts question

2018-09-06 Thread Shriramana Sharma via Unicode
Hello. This may be slightly OT for this list but I'm asking it here as it
concerns computer usage with multiple scripts and i18n:

1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
"tout" io Ctrl+A for "all"?

2) How about when the shortcuts are the Alt+ combinations referring to
underlined letters in actual user visible strings?

3) In a QWERTZ layout for Undo should one still press the (dislocated wrt
the other XCV shortcuts) Z key or the Y key which is in the physical
position of the QWERTY Z key (and close to the other XCV shortcuts)?

4) How are shortcuts handled in the case of non Latin keyboards like
Cyrillic or Japanese?

4a) I mean how are they displayed on screen?

4b) Like #1 above, are they changed per language?

4c) Like #2 above, how about for user visible shortcuts?

(In India since English is an associate official language, most computer
users are at least conversant with basic English so we use the
English/QWERTY shortcuts even if the keyboard physically shows an Indic
script.)

Thanks!


Re: Question about Karabakh Characters

2017-10-05 Thread Michael Everson via Unicode
It is legitimate to add characters for Armenian dialectology, and if you can 
provide additional evidence of usage in lexicography and (if possible) in other 
literature, we can see if a proposal can be made. 

We may do this offline so as to save the list from to many files. I look 
forward to hearing from you. Nothing will happen, though, without further 
information. 

Michael

> On 5 Oct 2017, at 06:09, via Unicode <unicode@unicode.org> wrote:
> 
> Thank you for your reply.
> I am currently handling technical support to publish in multi-language.
> 
> This was found when we were handling a project on the Karabakh language.
> I was informed that Karabakh has a dictionary containing over 40,000 words 
> that was produced in 2013 which employs the three characters.
> I personally have not seen this dictionary, but it seems that are ones that 
> need these characters.
> So I decided to make a post.
> 
> Kazunari Tsuboi
> 
> -Original Message-
> From: Michael Everson [mailto:ever...@evertype.com] 
> Sent: Wednesday, October 4, 2017 11:31 PM
> To: Tsuboi, Kazunari
> Cc: unicode Unicode Discussion
> Subject: Re: Question about Karabakh Characters
> 
> They are not encoded, but that example is not sufficient. If you’d like to 
> contact me offline we can discuss this further.
> 
> Michael Everson
> 
>> On 4 Oct 2017, at 08:39, via Unicode <unicode@unicode.org> wrote:
>> 
>> Hi there,
>> 
>> The Karabakh language uses Armenian characters, but the following 
>> characters do not have a Unicode assigned. (image1.JPG attached) They 
>> are pronounced “Yi”, “Ini” and “Eh” and used with several 
>> combinations. (Image2.JPG attached)
>> 
>> Is there any reason these characters are not supported by Unicode?
>> I would appreciate any related information.
>> 
>> Thank you!
>> 
>> Kazunari Tsuboi
>> 
> 
> 




RE: Question about Karabakh Characters

2017-10-04 Thread via Unicode
Thank you for your reply.
I am currently handling technical support to publish in multi-language.

This was found when we were handling a project on the Karabakh language.
I was informed that Karabakh has a dictionary containing over 40,000 words that 
was produced in 2013 which employs the three characters.
I personally have not seen this dictionary, but it seems that are ones that 
need these characters.
So I decided to make a post.

Kazunari Tsuboi

-Original Message-
From: Michael Everson [mailto:ever...@evertype.com] 
Sent: Wednesday, October 4, 2017 11:31 PM
To: Tsuboi, Kazunari
Cc: unicode Unicode Discussion
Subject: Re: Question about Karabakh Characters

They are not encoded, but that example is not sufficient. If you’d like to 
contact me offline we can discuss this further.

Michael Everson

> On 4 Oct 2017, at 08:39, via Unicode <unicode@unicode.org> wrote:
> 
> Hi there,
>  
> The Karabakh language uses Armenian characters, but the following 
> characters do not have a Unicode assigned. (image1.JPG attached) They 
> are pronounced “Yi”, “Ini” and “Eh” and used with several 
> combinations. (Image2.JPG attached)
>  
> Is there any reason these characters are not supported by Unicode?
> I would appreciate any related information.
>  
> Thank you!
>  
> Kazunari Tsuboi
> 




Re: Question about Karabakh Characters

2017-10-04 Thread Michael Everson via Unicode
They are not encoded, but that example is not sufficient. If you’d like to 
contact me offline we can discuss this further.

Michael Everson

> On 4 Oct 2017, at 08:39, via Unicode  wrote:
> 
> Hi there,
>  
> The Karabakh language uses Armenian characters, but the following characters 
> do not have a Unicode assigned. (image1.JPG attached)
> They are pronounced “Yi”, “Ini” and “Eh” and used with several combinations. 
> (Image2.JPG attached)
>  
> Is there any reason these characters are not supported by Unicode?
> I would appreciate any related information.
>  
> Thank you!
>  
> Kazunari Tsuboi
> 




Question about Karabakh Characters

2017-10-04 Thread via Unicode
Hi there,

The Karabakh language uses Armenian characters, but the following characters do 
not have a Unicode assigned. (image1.JPG attached)
They are pronounced "Yi", "Ini" and "Eh" and used with several combinations. 
(Image2.JPG attached)

Is there any reason these characters are not supported by Unicode?
I would appreciate any related information.

Thank you!

Kazunari Tsuboi


Re: XCCS (was: Historical question about 'universal signs')

2016-10-24 Thread seth erickson
See pg. 57-63 of this:

Xerox. (1985). *Xerox System Network Architecture: General Information
Manua*l (No. XNSG 068504). Retrieved from
http://archive.org/details/bitsavers_xeroxxnsXNNetworkArchitectureGeneralInformationMan_10024221

SE

On Sun, Oct 23, 2016 at 10:01 AM, Doug Ewell  wrote:

> seth erickson wrote:
>
> XCCS is fairly well documented
>>
>
> That hasn't been my experience. I'd be interested in any links you can
> forward that go beyond "Unicode built on" or "drew ideas from" or "was
> influenced by" XCCS.
>
> Thanks,
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>


XCCS (was: Historical question about 'universal signs')

2016-10-23 Thread Doug Ewell

seth erickson wrote:


XCCS is fairly well documented


That hasn't been my experience. I'd be interested in any links you can 
forward that go beyond "Unicode built on" or "drew ideas from" or "was 
influenced by" XCCS.


Thanks,

--
Doug Ewell | Thornton, CO, US | ewellic.org 



Historical question about 'universal signs'

2016-10-21 Thread seth erickson
Greetings Unicoders,

I'm trying to find information (for research purposes) about a character
set mentioned in Joseph Becker's 1988 draft proposal [1]:

"In 1978, the initial proposal for a set of 'Universal Signs' was made by
Bob Belleville at Xerox PARC. Many persons contributed ideas to the
development of a new encoding design. Beginning in 1980, these efforts
evolved into the Xerox Character Code Standard (XCCS) [...]"

XCCS is fairly well documented but I'm having trouble finding anything
about the proposal by Bob Belleville. Any pointers would be appreciated.

Thanks,

Seth Erickson
PhD student
Department of Information Studies
University of California, Los Angeles


[1] http://unicode.org/history/unicode88.pdf


Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Karl Williamson

On 11/06/2015 01:32 PM, Richard Wordingham wrote:

On Thu, 05 Nov 2015 13:41:42 -0700
"Doug Ewell"  wrote:


Richard Wordingham wrote:


No-one's claiming it is for a Unicode Transformation Format (UTF).


Then they ought not to call it "UTF-8" or "extended" or "modified"
UTF-8, or anything of the sort, even if the bit-shifting algorithm is
based on UTF-8.



"UTF-8 encoding form" is defined as a mapping of Unicode scalar values
-- not arbitrary integers -- onto byte sequences. [D92]


If it extends the mapping of Unicode scalar values *into* byte
sequences, then it's an extension.  A non-trivial extension of a
mapping of scalar values has to have a larger domain.

I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks.

Richard.



I have no idea how my original message ended up being marked to send to 
this list.  I'm sorry.  It was meant to be a personal message for 
someone who I believe was involved in the original design.


Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Otto Stolz

Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich:

First of all, “reserved” means that they have no meaning.  Right?


Almost.

“Reserved” means that they have currently no meaning
but may be assigned a meaning, later; hence you ought
not use them lest your programs, or data, be invalidated
by later amendmends of the pertinent specification.

In contrast, “invalid”, or “ill-formed” (Unicode term),
means that the particular bit pattern may never be used
in a sequence that purports to represent Unicode characters.
In practice, that means that no programm is allowed to
send those ill-formed patterns in Unicode-based data exchange,
and every program should refuse to accept those ill-formed
patterns, in Unicode-based data exchange.

What a program does internally is at the discretion (or should
I say: “whim”?) of its author, of course – as long as the
overall effect of the program complies with the standard.

Best wishes,
  Otto Stolz







Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Richard Wordingham
On Thu, 05 Nov 2015 13:41:42 -0700
"Doug Ewell"  wrote:

> Richard Wordingham wrote:
> 
> > No-one's claiming it is for a Unicode Transformation Format (UTF).
> 
> Then they ought not to call it "UTF-8" or "extended" or "modified"
> UTF-8, or anything of the sort, even if the bit-shifting algorithm is
> based on UTF-8.

> "UTF-8 encoding form" is defined as a mapping of Unicode scalar values
> -- not arbitrary integers -- onto byte sequences. [D92]

If it extends the mapping of Unicode scalar values *into* byte
sequences, then it's an extension.  A non-trivial extension of a
mapping of scalar values has to have a larger domain.

I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks.

Richard.


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Philippe Verdy
It won't represent any valid Unicode codepoint (no standard scalar value
defined), so if you use those leading bytes, don't pretend it is for
"UTF-8" (not even "modified UTF-8" which is the variant created in Java for
its internal serialization of unrestricted 16-bit strings, including for
lone surrogates, and modified also in its representation of U+ as
<0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create
your own charset identifier (e.g. "perl5-UTF-8-extended" or some name
derived from your Perl5 library) and say it is not fot use for interchange
of standard text.

The extra code points you'll get are then necessarily for private use (but
still not part of the standard PUA set), and have absolutely no defined
properties from the standard. They should not be used to represent any
Unicode character or character sequence. In any API taking some text input,
those code points will never be decoded and will behave on input like
encoding errors.

But these extra code points could be used to represent someting else such
as unique object identifier for internal use in your application, or
virtual object pointers, or or shared memory block handles,
file/pipe/stream I/O handles, service/API handles, user ids, security
tokens, 64-bit content hashes plus some binary flags,
placeholders/references for members in an external unencoded collection or
for URIs, or internal glyph ids when converting text for rendering with one
or more fonts, or some internal serialization of geometric
shapes/colors/styles/visual effects...)

In the standard UTF-8 those extra byte values are not "reserved" but
permanently assigned to be "invalid", and there are no valid encoded
sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC
version of UTF-8 when it allowed code points up to 31 bits, but even this
RFC is obsolete and should no longer be used and it has never been approved
by Unicode).


2015-11-05 16:57 GMT+01:00 Karl Williamson :

> Hi,
>
> Several of us are wondering about the reason for reserving bits for the
> extended UTF-8 in perl5.  I'm asking you because you are the apparent
> author of the commits that did this.
>
> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the
> length of the sequence of bytes that comprise a single character to be 13
> bytes.  This allows code points up to 2**72 - 1 to be represented. If the
> length had been instead 12 bytes, code points up to 2**66 - 1 could be
> represented, which is enough to represent any code point possible in a
> 64-bit word.
>
> The comments indicate that these extra bits are "reserved".  So we're
> wondering what potential use you had thought of for these bits.
>
> Thanks
>
> Karl Williamson
>


Question about Perl5 extended UTF-8 design

2015-11-05 Thread Karl Williamson

Hi,

Several of us are wondering about the reason for reserving bits for the 
extended UTF-8 in perl5.  I'm asking you because you are the apparent 
author of the commits that did this.


To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the 
length of the sequence of bytes that comprise a single character to be 
13 bytes.  This allows code points up to 2**72 - 1 to be represented. 
If the length had been instead 12 bytes, code points up to 2**66 - 1 
could be represented, which is enough to represent any code point 
possible in a 64-bit word.


The comments indicate that these extra bits are "reserved".  So we're 
wondering what potential use you had thought of for these bits.


Thanks

Karl Williamson


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Markus Scherer
On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy  wrote:

> (0xFF was reserved only in the old RFC version of UTF-8 when it allowed
> code points up to 31 bits, but even this RFC is obsolete and should no
> longer be used and it has never been approved by Unicode).
>

No, even in the original UTF-8 definition, "The octet values FE and FF
never appear." https://tools.ietf.org/html/rfc2279
The highest lead byte was 0xFD.

(For the "really original" version see
http://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf)

In the current definition, "The octet values C0, C1, F5 to FF never
appear." https://tools.ietf.org/html/rfc3629 =
https://tools.ietf.org/html/std63

markus


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Richard Wordingham
On Thu, 5 Nov 2015 18:25:05 +0100
Philippe Verdy  wrote:

> But these extra code points could be used to represent someting else
> such as unique object identifier for internal use in your
> application, or virtual object pointers, or or shared memory block
> handles, file/pipe/stream I/O handles, service/API handles, user ids,
> security tokens, 64-bit content hashes plus some binary flags,
> placeholders/references for members in an external unencoded
> collection or for URIs, or internal glyph ids when converting text
> for rendering with one or more fonts, or some internal serialization
> of geometric shapes/colors/styles/visual effects...)

No-one's claiming it is for a Unicode Transformation Format (UTF).  A
possibly relevant example of a something else is a non-precomposed
grapheme cluster, as in Perl6's NFG.  (This isn't a PUA encoding, as
the precomposed characters are created on the fly.)

Richard.


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Doug Ewell
Richard Wordingham wrote:

> No-one's claiming it is for a Unicode Transformation Format (UTF).

Then they ought not to call it "UTF-8" or "extended" or "modified"
UTF-8, or anything of the sort, even if the bit-shifting algorithm is
based on UTF-8.

"UTF-8 encoding form" is defined as a mapping of Unicode scalar values
-- not arbitrary integers -- onto byte sequences. [D92]

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Ilya Zakharevich
On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote:
> Several of us are wondering about the reason for reserving bits for
> the extended UTF-8 in perl5.  I'm asking you because you are the
> apparent author of the commits that did this.

To start, the INTERNAL REPRESENTATION of Perl’s strings is the «utf8»
format (not «UTF-8», «extended» or not).  [I see that this misprint
caused a lot of stir here!]

However, outside of a few contexts, this internal representation
should not be visible.  (However, some of these contexts are close to
the default, like read/write in Unicode mode, with -C switch.)

Perl’s string is just a sequence of Perl’s unsigned integers.
[Depending on the build, this may be, currently, 32-bit or 64-bit.]
By convention, the “meaning” of small integers coincides with what
Unicode says.

> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes
> the length of the sequence of bytes that comprise a single character
> to be 13 bytes.  This allows code points up to 2**72 - 1 to be
> represented. If the length had been instead 12 bytes, code points up
> to 2**66 - 1 could be represented, which is enough to represent any
> code point possible in a 64-bit word.
> 
> The comments indicate that these extra bits are "reserved".  So
> we're wondering what potential use you had thought of for these
> bits.

First of all, “reserved” means that they have no meaning.  Right?

Second, there are 2 ways in which one may need this INTERNAL format to
be extended:
  • 128-bit architectures may be at hand (sooner or later).
  • One may need to allow “objects” to be embedded into Perl strings.

With embedded objects, one must know how to kill them when the string
(or its part) is removed.  So, while a pointer can fit into a Perl
integer, one needs to specify what to do: call DESTROY, or free(), or
a user-defined function.

This gives 5 possibilities (3 extra bits) which may be needed with
“slots” in Perl strings.
  • Integer (≤64 bits)
  • Integer (≥65 bits) 
  • Pointer to a Perl object
  • Pointer to a malloc()ed memory
  • Pointer to a struct which knows how to destroy itself.
  struct self_destroy { void *content; void destroy(struct self_destroy*); }

Why one may need objects embedded into strings?  I explained it in
   http://ilyaz.org/interview
(look for «Emacs» near the middle).

Hope this helps,
Ilya


Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Philippe Verdy
2015-11-05 23:11 GMT+01:00 Ilya Zakharevich  wrote
>
>   • 128-bit architectures may be at hand (sooner or later).

This is specialation for something that is still not envisioned: a global
worldwide working space where users and applications would interoperate
transparently in a giant virtualized environment. However, this virtualized
environment will be supported by 64-bit OSes that will never need native
support of more the 64-bit pointers. Those 128-bit entities needed for
adressing will not be used to work on units of data but to address some
small selection of remote entities.

Softwares that would requiring parsing coompletely chunks of memory data
larger than 64-bit would be extremely inefficient, instead this data will
be internally structured/paged, and only virtually mapped to some 128 bit
global reference (such as GUID/UUIDs) only to select smaller chunks within
the structure (and in most cases those chunks will remain in a 32-bit space
(even in today's 64-bit OSes, the largest pages are 20-bit wide, but
typically 10-bit wide (512-byte sectors) to 12-bit wide (standard VMM and
I/O page sizes, networking MTUs), or about 16-bit wide (such as
transmission window for TCP). This will not eveolve significantly before a
major evolution in the worldwide Internet backbones requiring more than
about 1Gigabit/s (a speed not even needed for 4K HD video, but needed only
in massive computing grids, still built with a complex mesh of much slower
data links).

With 64-bit we already reach the physical limits of networking links, and
higher speeds using large buses are only for extremely local links whose
lengths are largely below a few millimters within chips themselves.

128 bit however is possible not for the working spaces (or document sizes)
it will be very unlikely that ANSI C/C++ "size_t" type will be more than
64-bit (ecept for a few experimentations which will fail to be more
efficient).

What is more realist is that internal buses and caches will be 128 bits or
even larger (this is already true for GPU memory), only to support more
parallelism or massive parallelism (and typically by using vectored
instructions working on sets of smaller values).

And some data need 128-bit values for their numerical ranges (ALUs in
CPU/GPU/APU are already 128-bit, as well as common floating point types)
where extra precision is necessary.

I doubt we'll ever see any true native 128-bit architecture in any time of
our remaining life. We are still very far from the limit of the 64-bit
architecture and it won't happend before the next century (if the current
sequential binary model for computing is still used at that time, may be
computing will use predictive technologies returning only heuristic results
with a very high probability of giving a good solution to the problems
we'll need to solve extremely rapidly, and those solutions will then be
validated using today's binary logic with 64-bit computing).

Even in the case where a global 128-bit networking space would appear,
users will never be exposed to all that, msot of this content will be
unacessible to them (restricted by secuiry concerns or privacy) and simply
unmanageable by them : no one on earth is able to have any idea of what
2^64 bits of global data represents, no one will ever need it in their
whole life. That amount of data will only be partly implemented by large
organisations trying to build a giant cloud and whiching to interoperate by
coordinating their addressing spaces (for that we have now IPv6).

So your "sooner or later" is very optimistic.

IMHO we'll stay with 64-bit architectures for very long, up to the time
where our seuqnetial computing model will be deprecated and the concept of
native integer sizes will be obsoleted and replaced by other kinds of
computing "units" (notably parallel vectors, distributed computing, and
heuristic computing, or may be optical computing based on Fourier
transforms on analog signals or quantum computing, where our simple notion
of "integers" or even "bits" will not even be placeable into individual
physically placed units; their persistence will not even be localized, and
there will be redundant/fault-tolerant placements).

In fact our computing limits wil no longer be in terms of storage space,
but in terms of access time, distance and predictability of results.

The next technologies for faster computing will be certainly
predictive/probabilistic rather than affirmative (with today's Turing/Von
Neumann machines). "Algorithms" for working with it will be completely
different. Fuzzy logic will be everywhere and we'll even need less the
binary logic except for small problems. We'll have to live with the
possibility of errors but anyway we already have to live with them evne
with our binary logic (due to human bugs, haardware faults, accidents, and
so on...) In most problems we don't even need to have 100% proven solutions

(e.g. viewing a high-quality video, we already accept the 

Re: Question about the Sentence_Break property

2015-02-21 Thread Karl Williamson

On 02/20/2015 04:56 PM, Philippe Verdy wrote:

2015-02-20 6:14 GMT+01:00 Richard Wordingham
richard.wording...@ntlworld.com mailto:richard.wording...@ntlworld.com:

TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
One thing that is missing is mention of the convention that a single
newline character (or CRLF pair) is a line break whereas a doubled
newline character denotes a paragraph break.


In that case CR or LF characters alone are not paragraph separators by
themselves unless they are grouped together. Like NEL, they should just
be considered as line separators and the terminology used in UAX 29 rule
SB4 is effectively incorrect if what matters here is just the linebreak
property. And also in that case, the SB4 rule should effecticely include
NEL (from the C1 subset).

But as SB4 is only related to sentence breaking, It would be e problem
because simple linebreaks are used extremely frequently in the middle of
sentences.

What the Sentence break algorithm should say is that there should first
be a preprossing step separating line breaks and paragraph breaks
(creating custom entities,(similar to collation elements, but encoded
internally with a code point out of the standard space), that the rule
SB4 would use instead of Sep | CR | LF. That custome entity should be
Sep but without the rule defining it, as there are various ways to
represent paragraph breaks.



But isn't SB4 contradictory to this from TUS Section 5.8?

R2c In parsing, choose the
safest interpretation.
For example, in recommendation R2c an implementer dealing with sentence 
break heuris-

tics would reason in the following way that it is safer to interpret any
NLF
as LS:
• Suppose an
NLF
were interpreted as LS, when it was meant to be PS. Because
most paragraphs are terminated with punctuation anyway, this would cause
misidentification of sentence boundaries in only a few cases.
• Suppose an
NLF
were interpreted as PS, when it was meant to be LS. In this
case, line breaks would cause sentence br
eaks, which would result in significant
problems with the sentence break heuristics

It seems to me SB4 is choosing the non-safer way.  What am I missing?

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about the Sentence_Break property

2015-02-19 Thread Richard Wordingham
On Thu, 19 Feb 2015 19:55:20 -0700
Karl Williamson pub...@khwilliamson.com wrote:

 UAX 29 says this:
 
 Break after paragraph separators.
 SB4.  Sep | CR | LF   
 
 Why are CR and LF considered to be paragraph separators?  NEL and
 Line Break are as well.
 
 My mental model of plain text has it containing embedded characters, 
 which I'll call \n, to allow it to be displayed in a terminal window
 of a given width.  Not all text is like that, of course, but there is
 an awful lot that is.  This rule makes no sense to me.

There are two types of plain text - that which requires explicit
line-breaking, and that which does not.  This is a case where a
non-linguistic tailoring is required.

TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
One thing that is missing is mention of the convention that a single
newline character (or CRLF pair) is a line break whereas a doubled
newline character denotes a paragraph break.

Richard.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Question about the Sentence_Break property

2015-02-19 Thread Karl Williamson

UAX 29 says this:

Break after paragraph separators.
SB4.Sep | CR | LF   

Why are CR and LF considered to be paragraph separators?  NEL and Line 
Break are as well.


My mental model of plain text has it containing embedded characters, 
which I'll call \n, to allow it to be displayed in a terminal window of 
a given width.  Not all text is like that, of course, but there is an 
awful lot that is.  This rule makes no sense to me.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Steffen Nurpmeso
Philippe Verdy verd...@wanadoo.fr wrote:
 |glibc is not more borken and any other C library implementing toupper and
 |tolower from the legacy ctype standard library. These are old APIs that
 |are just widely used and still have valid contexts were they are simple and
 |safe to use. But they are not meant to convert text.

Hah!  Legacy is good..  I'd wish a usable successor were already
standardized by ISO C.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about Uppercase in DerivedCoreProperties.txt

2014-11-10 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 glibc is not more borken and any other C library implementing toupper
 and tolower from the legacy ctype standard library. These are old
 APIs that are just widely used and still have valid contexts were they
 are simple and safe to use. But they are not meant to convert text.

Well, of course they are *meant* to convert text. They're just not very
good at it.

--
Doug Ewell | Thornton, CO, USA | http://ewellic.org


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Philippe Verdy
Successors to convert strings instead of just isolated characters (sorry,
they are NOT what we need to handle texts, they are not even equivalent
to Unicode characters, they are just code units, most often 8-bit with
char or 16-bit only with wchar_t !) already exist in all C libraries
(including glibc), under different names unfortunately (this is the main
cause why there are complex header files trying to find the appropriate
name, and providing a default basic implementation that just scans
individual characters to filter them with tolower and toupper: this is a
bad practice,

Good libraries should all contain a safe implementation of case conversion
of strings, and softwares should use them (and not reinvent this old bad
trick, just because this works with basic English).


2014-11-10 13:41 GMT+01:00 Steffen Nurpmeso sdao...@yandex.com:

 Philippe Verdy verd...@wanadoo.fr wrote:
  |glibc is not more borken and any other C library implementing toupper and
  |tolower from the legacy ctype standard library. These are old APIs that
  |are just widely used and still have valid contexts were they are simple
 and
  |safe to use. But they are not meant to convert text.

 Hah!  Legacy is good..  I'd wish a usable successor were already
 standardized by ISO C.

 --steffen
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Philippe Verdy
The equivalent of strtolower() and strtoupper() is implemented in all C
libraries I know (yes, including glibc) and I have worked with on various
OSes (and since very long!), even if their names change (because of the
unfortunate lack of standardization about their interaction with C locales).

The standardisation of these two functions should have already been
made since very long, even if the locales support could be limited to the
legacy basic C locale with limited functionality, where these functions
would just scan characters through strings to convert them with toupper()
and to lower(). But then glibc and other libraries wiould have implemented
this standard.
For now, we still need complex config scripts to detect the correct
headers to include, or to provide a basic implementation via various macros.

The standard C++ string package could have then used this standard
internally in the methods exposed in its API. I cannot understand this
simple effort was never done on such basic functionality needed and used in
almost all softwares and OSes.

2014-11-10 19:55 GMT+01:00 Steffen Nurpmeso sdao...@yandex.com:

 Philippe Verdy verd...@wanadoo.fr wrote:
  |Successors to convert strings instead of just isolated characters
 (sorry,
  |they are NOT what we need to handle texts, they are not even equivalent
  |to Unicode characters, they are just code units, most often 8-bit with
  |char or 16-bit only with wchar_t !) already exist in all C libraries
  |(including glibc), under different names unfortunately (this is the main
  |cause why there are complex header files trying to find the appropriate
  |name, and providing a default basic implementation that just scans
  |individual characters to filter them with tolower and toupper: this is a
  |bad practice,

 glibc is the _only_ standard C library i know of that supports its
 own homebrew functionality regarding the issue (and in a way that
 i personally don't want to and will never work with).
 Even the newest ISO C doesn't give just any hand, so that no ISO C
 programmer can expect to use any standard facility before 2020, if
 that is the time, and then operating systems have to adhere to
 that standard, and then programmers have to be convinced to use
 those functions.
 Until then different solutions will have to be used.

 --steffen

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Steffen Nurpmeso
Philippe Verdy verd...@wanadoo.fr wrote:
 |Successors to convert strings instead of just isolated characters (sorry,
 |they are NOT what we need to handle texts, they are not even equivalent
 |to Unicode characters, they are just code units, most often 8-bit with
 |char or 16-bit only with wchar_t !) already exist in all C libraries
 |(including glibc), under different names unfortunately (this is the main
 |cause why there are complex header files trying to find the appropriate
 |name, and providing a default basic implementation that just scans
 |individual characters to filter them with tolower and toupper: this is a
 |bad practice,

glibc is the _only_ standard C library i know of that supports its
own homebrew functionality regarding the issue (and in a way that
i personally don't want to and will never work with).
Even the newest ISO C doesn't give just any hand, so that no ISO C
programmer can expect to use any standard facility before 2020, if
that is the time, and then operating systems have to adhere to
that standard, and then programmers have to be convinced to use
those functions.
Until then different solutions will have to be used.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Steffen Nurpmeso
Philippe Verdy verd...@wanadoo.fr wrote:
 |The standard C++ string package could have then used this standard
 |internally in the methods exposed in its API. I cannot understand this
 |simple effort was never done on such basic functionality needed and used in
 |almost all softwares and OSes.

There are plenty of other things one can bang his head on as
necessary, _that_ is for sure.  Even overwhelmingly, the
pessimistic may say.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-08 Thread Mike FABIAN
Philippe Verdy verd...@wanadoo.fr さんはかきました:

 note that tolower() and toupper() can only work one 1-character level, it
 is not recommended for use for changing case of plain text.

 For correct handling of locales, to upper and toupper should be replaced by
 strtolower and strtoupper (or their aliases) which will be able to process
 character clusters and contextual casing rules needed for a language or
 orthographic style

Yes, thank you for explaining this.

But these details of upper and lower casing cannot be expressed in the
“i18n” file of glibc:

https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n

For toupper and tolower, this file just has character - character
mapping tables, for example the “tolower” table contains only

(U03A3,U03C3)

(i.e. mapping Σ U+03A3 - σ U+03C3, never to the final sigma ς
U+03C2).

More correct, detailed information about upper and lower case must come
from elsewhere, not from this “i18n” file in glibc.  Using only the
information from this “i18n” file, not even the Greek sigma can be
handled correctly.

Pravin and me want to update this “i18n” file to the latest
data from Unicode 7.0.0, doing it as correct as possible within
the limitations caused by this file and the ISO C standard.

-- 
Mike FABIAN mfab...@redhat.com
☏ Office: +49-69-365051027, internal 8875027
睡眠不足はいい仕事の敵だ。
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-08 Thread Philippe Verdy
Do not try to get consisant results with only a character to character
mapping, it does not work with all letters, because sometimes you need 1-2
or 2-1 mappings (not all composable characters exist in precombined forms,
or sometimes the combination must be split into its canonical decomposed
equivalent prior to map the base character) or other mappings.
toupper() and tolower() should not be used for something else than just
mapping number-like sequences (e.g. to convert hexadecimal numbers).

Use strupper() and strlower() (or equivalent functions not alocating memory
but writing to a given buffer or stream, and similiar functions to other
languages than C/C++) to perform mappings on full strings so that the
string length can safely change.
- this is needed for example to convert city names or people names to
capitals in a postal address, or to style a book title or chapter heading).
- it is needed as well to perform case insensitive searches (using case
folding, which is different from converting to lowercase or to uppercase)
to match input, or to implement some input completion UI to locate possible
matches within a known dictionnary or input history.


2014-11-08 10:22 GMT+01:00 Mike FABIAN mfab...@redhat.com:

 Philippe Verdy verd...@wanadoo.fr さんはかきました:

  note that tolower() and toupper() can only work one 1-character level, it
  is not recommended for use for changing case of plain text.
 
  For correct handling of locales, to upper and toupper should be replaced
 by
  strtolower and strtoupper (or their aliases) which will be able to
 process
  character clusters and contextual casing rules needed for a language or
  orthographic style

 Yes, thank you for explaining this.

 But these details of upper and lower casing cannot be expressed in the
 “i18n” file of glibc:

 https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n

 For toupper and tolower, this file just has character - character
 mapping tables, for example the “tolower” table contains only

 (U03A3,U03C3)

 (i.e. mapping Σ U+03A3 - σ U+03C3, never to the final sigma ς
 U+03C2).

 More correct, detailed information about upper and lower case must come
 from elsewhere, not from this “i18n” file in glibc.  Using only the
 information from this “i18n” file, not even the Greek sigma can be
 handled correctly.

 Pravin and me want to update this “i18n” file to the latest
 data from Unicode 7.0.0, doing it as correct as possible within
 the limitations caused by this file and the ISO C standard.

 --
 Mike FABIAN mfab...@redhat.com
 ☏ Office: +49-69-365051027, internal 8875027
 睡眠不足はいい仕事の敵だ。

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-08 Thread Christopher Vance
So glibc is broken. This doesn't make it a Unicode problem.

On Sat, Nov 8, 2014 at 8:22 PM, Mike FABIAN mfab...@redhat.com wrote:

 Philippe Verdy verd...@wanadoo.fr さんはかきました:

  note that tolower() and toupper() can only work one 1-character level, it
  is not recommended for use for changing case of plain text.
 
  For correct handling of locales, to upper and toupper should be replaced
 by
  strtolower and strtoupper (or their aliases) which will be able to
 process
  character clusters and contextual casing rules needed for a language or
  orthographic style

 Yes, thank you for explaining this.

 But these details of upper and lower casing cannot be expressed in the
 “i18n” file of glibc:

 https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n

 For toupper and tolower, this file just has character - character
 mapping tables, for example the “tolower” table contains only

 (U03A3,U03C3)

 (i.e. mapping Σ U+03A3 - σ U+03C3, never to the final sigma ς
 U+03C2).

 More correct, detailed information about upper and lower case must come
 from elsewhere, not from this “i18n” file in glibc.  Using only the
 information from this “i18n” file, not even the Greek sigma can be
 handled correctly.

 Pravin and me want to update this “i18n” file to the latest
 data from Unicode 7.0.0, doing it as correct as possible within
 the limitations caused by this file and the ISO C standard.

 --
 Mike FABIAN mfab...@redhat.com
 ☏ Office: +49-69-365051027, internal 8875027
 睡眠不足はいい仕事の敵だ。
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode




-- 
Christopher Vance
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-08 Thread Philippe Verdy
glibc is not more borken and any other C library implementing toupper and
tolower from the legacy ctype standard library. These are old APIs that
are just widely used and still have valid contexts were they are simple and
safe to use. But they are not meant to convert text.

The i18n data just shows the mappings used for tolower, toupper (and
totile) but it is clearly not enough to implement strtolower and strtoupper
which require more rules (notably 1 to 2 or 2 to 1 mappings, plus support
for normalisation/composition/decomposition and recognizing canonical
equivalents, in all possible reorderings, and more data for contextual
rules such as the final form of sigma). Such data may be be easily
expressible in some cases with such tabular format, and could be
implemented by locale-specific code, for example to handle some dictionary
lookups (as required with some Asian scripts for word breaking, and
implicilty needed for the Korean script whose normalisation is not handle
by table lookups but algorithmically by code only within the normalizer)

I don't see anything wrong with existing glibc 18n data. Glibc would be
wrong however if it *only* used tolower/toupper to implement
strtolower/strtoupper (but this was what was still done in the past since
the creation of the standard C library on Unix and even later on DOS,
MacOS, Windows and most other systems... before the creation of Unicode and
its development to support more languages, scripts, and orthographic
systems.)

Modern i18n libraries (for various programming languages) contain more
advanced support API for correct case mappings on full strings (including
M-to-N mappings, contextual rules and support of canonical equivalences),
and these API no longer assume that the output string will be the same
length as the input and only 1:1 mappings will be performed over each
character (even if this is still what is done when using the C root
locale working only for a few languages and only with simple texts using
restricted alphabets without all the possible Unicode extensions, needed
now to support more than the native language but also many proper names and
foreign toponyms, or texts containing small citations in another
language, or any multilingual document).

2014-11-09 1:45 GMT+01:00 Christopher Vance cjsva...@gmail.com:

 So glibc is broken. This doesn't make it a Unicode problem.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-07 Thread Philippe Verdy
note that tolower() and toupper() can only work one 1-character level, it
is not recommended for use for changing case of plain text. Its purpose
should be limited to use cases where letters can be safely isolated from
their context, for example when handling letters as numbers (e.g. section
numbering).

For correct handling of locales, to upper and toupper should be replaced by
strtolower and strtoupper (or their aliases) which will be able to process
character clusters and contextual casing rules needed for a language or
orthographic style (such as monotonic and polytonic Greek, or for specific
locales intended for medieval texts or old classic scriptures).
strupper and strlower can then perform MORE mappings that tolower and
toupper cannot perform using only simple mappings. So precombined Greek
letters with iota subscripts can only be converted by preserving the iota
subscript (for which islower() and isupper() are BOTH false when it is
encoded separately and not precombined).

When a Greek letter precombined with a iota subscript is found, the letter
case of this iota subscript should be ignored, and only the lettercase of
the base letter will be considered, and this means that it will only be
possible for toupper() and toupper() to map one orthographic style: the
style that preserves the subscript but not the classic Greek or modern
monotonic style that doesn't know anything about this medieval
extension of the Greek alphabet, which was still in use in the begining of
the 1970's (handling polytonic Greek with tolower() and toupper(), or with
islower() and isupper() will not produce the correct result). For modern
Greek, there's no use of this iota subscript, so we are in the same
situation as classic Greek (before the Christian era), except that modern
Greek still uses a few accents (notably the tonos equivalent in Unicode
to the acute accent, even if its placement over Greek capitals is
preferably before the letter rather than above it as it could be suggested
by its assigned combining class).

2014-11-07 12:32 GMT+01:00 Mike FABIAN mfab...@redhat.com:

 Philippe Verdy verd...@wanadoo.fr さんはかきました:

  this is a feature of the Greek alphabet that the lowercase iota
 subscript
  can be capitalized in two different ways : either as a subscript below
 the
  uppercase main letter, or as a standard iota capitalized. The subscript
  form is a combining character, but not the non-subscript form.

 Laurentiu All of the characters you enumerated are titlecase letters
 Laurentiu (gc=Lt) rather than uppercase letters (gc=Lu),

 U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ.
 ᾈ is something like Ἀι so I understand now that ᾈ can be considered as
 titlecase (gc=Lt).


Note that for modern Greek there's still a difficulty about the special
final form of lowercase sigma: it is effectively lowercase (islower should
return true), not titlecase, and toupper will map it to a standard capital
Sigma. But the reverse conversion will only be able to convert the
uppercase sigma to a standard lowercase sigma, ignoring the final form. To
handle the final form correctly, don't use tolower() character per
character, but use strtolower() and use a decent library that supports
contextual rules (the same will be true for the German ess-tsett which was
capitalized as a two S but not reversible, even if recently an uppercase
variant of ess-tsett was added in Unicode, but it is still extremely rarely
used: it is extremly difficult to determine how to convert a double capital
S and most libraries will only convert it to a double lowercase s, and some
locales deliberatly decide not to alter the lowercase ess-tsett with
loupper or strtoupper; this is still correct if those libraries have not be
updated to use the capital ess-tsett now supported in more recent versions
of Unicode, but not found in any other legacy encodings).

We still have a difficulty with the ampersand  because it has been
encoded only as a symbol, assuming that for most used locales it is just
used in isolation as an abbreviated form of a word. But in some locales it
was still considered a letter and used everywhere et could be used
including in abreviations like etc. == c., or in the middle of words
like caret == car or commtre == commettre). But the modern use of
ampersand implies there's a word break before and after the symbol an we
should have a separate encoding for  as a lowercase ligature, and we
should even have an uppercase variant like the German ess-tsett, as there
are glyphic variants of the ligature for uppercased titles where the modern
 ampersand does not fit very well, or where it should be mapped to a
non-ligatured ET letter pair, distinct from the mapping (with spaces
around) to  ET  in French or to  AND  in English, as implied by the
modern meaning of the current symbol as a separate word by itself. With a
distinct encoding of the ligature, the common abreviation etc. ligatured
as c. would correctly map to uppercase C. with 

Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-07 Thread Mike FABIAN
Philippe Verdy verd...@wanadoo.fr さんはかきました:

 this is a feature of the Greek alphabet that the lowercase iota subscript
 can be capitalized in two different ways : either as a subscript below the
 uppercase main letter, or as a standard iota capitalized. The subscript
 form is a combining character, but not the non-subscript form.

Now I understand why these are titlecase letters, as Laurentiu
explained:

Laurentiu All of the characters you enumerated are titlecase letters
Laurentiu (gc=Lt) rather than uppercase letters (gc=Lu),

U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ.
ᾈ is something like Ἀι so I understand now that ᾈ can be considered as
titlecase (gc=Lt).

Thank you very much, Phillipe and Laurentiu for explaining!

I stumbled on this question because I am trying to update the character
class data for glibc for Unicode 7.0.0.

glibc has character classes “upper” and “lower” but not “title”.

Bruno Haible’s program to generate the character class data from
UnicodeData.txt tries to enforce that every character which has
a “toupper” mapping *must* be in either “upper” or “lower”.

https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/gen-unicode-ctype.c;h=0c001b299d4601a375a1e814fd2ab06b0536b337;hb=HEAD#l660

I think Bruno’s program does this because

ISO C 99 (ISO/IEC 9899 - Programming languages - C)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

contains:

 7.4.2.2 The toupper function
 
 [...]
 
 If the argument is a character for which islower is true and there are
 one or more corresponding characters, as specified by the current
 locale, for which isupper is true, the toupper function returns one of
 the corresponding characters (always the same one for any given locale);
 otherwise, the argument is returned unchanged.

which seems to require that toupper should only do something for
characters where islower is true.

Therefore, Bruno’s program puts title case characters like U+1F88 ᾈ
or U+01C5 Dž into *both*, “upper” and “lower”. Which does not
look so unreasonable, given the limitations of C99.

So it looks like because of this limitation, we have to continue using
this approach because ISO C 99 requires it, we cannot use the
“Uppercase” property from DerivedCoreProperties.txt for this.

But the “Alphabetic” property from DerivedCoreProperties.txt can
probably be used to generate the “alpha” character class for glibc.

I hope this is correct.

-- 
Mike FABIAN mfab...@redhat.com
☏ Office: +49-69-365051027, internal 8875027
睡眠不足はいい仕事の敵だ。
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Question about Uppercase in DerivedCoreProperties.txt

2014-11-06 Thread Mike FABIAN

I have a question about “Uppercase” in DerivedCoreProperties.txt:

U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
is listed as “Lowercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt :

   1F80..1F87; Lowercase # L   [8] GREEK SMALL LETTER ALPHA WITH PSILI 
AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND 
YPOGEGRAMMENI

But

“U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
is *not* listed as “Uppercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 
0345N;;;1F88;;1F88
1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 
0345N1F80;

Is the information in DerivedCoreProperties.txt correct or
could this be a bug in DerivedCoreProperties.txt?

The above is not only the case for U+1F88, but for several more characters.

All the characters listed below have a tolower mapping in 
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
but are not listed in DerivedCoreProperties.txt as “Uppercase”:

U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ
U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ
U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ
U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ
U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ
U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ
U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ
U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ
U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ
U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ
U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ
U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ
U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ
U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ
U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ
U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ
U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ
U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ
U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ
U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ
U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ
U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ
U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ
U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ
U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ
U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ
U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ

Is that correct or a bug?


-- 
Mike FABIAN mfab...@redhat.com
☏ Office: +49-69-365051027, internal 8875027
睡眠不足はいい仕事の敵だ。
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-06 Thread Mike FABIAN

I have a question about “Uppercase” in DerivedCoreProperties.txt:

U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
is listed as “Lowercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt :

   1F80..1F87; Lowercase # L   [8] GREEK SMALL LETTER ALPHA WITH PSILI 
AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND 
YPOGEGRAMMENI

But

“U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
is *not* listed as “Uppercase” in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

Although U+1F80 seems to be Uppercase according to
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
because it has a tolower mapping to U+1F80:

1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 
0345N;;;1F88;;1F88
1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 
0345N1F80;

Is the information in DerivedCoreProperties.txt correct or
could this be a bug in DerivedCoreProperties.txt?

The above is not only the case for U+1F88, but for several more characters.

All the characters listed below have a tolower mapping in 
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
but are not listed in DerivedCoreProperties.txt as “Uppercase”:

U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ
U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ
U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ
U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ
U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ
U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ
U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ
U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ
U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ
U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ
U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ
U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ
U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ
U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ
U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ
U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ
U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ
U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ
U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ
U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ
U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ
U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ
U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ
U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ
U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ
U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ
U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ

Is that correct or a bug?

-- 
 Mike FABIAN   mike.fab...@gmx.de
睡眠不足はいい仕事の敵だ。
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-06 Thread Philippe Verdy
this is a feature of the Greek alphabet that the lowercase iota subscript
can be capitalized in two different ways : either as a subscript below the
uppercase main letter, or as a standard iota capitalized. The subscript
form is a combining character, but not the non-subscript form. There shouls
exist a special contextual rule for language specific casings, there's one
already for the final sigma; but not the iota. It is not evident to handle
and in fact the choice of case mapping is not specifically a lingusitic
rule but a rendering style rule : for carved inscriptions, which are
generally using only capitals, the combining forms are generally avoided
and a reduced alphabet is used. For handwritten and cursive styles, the
extended alphabet is used and this enables contextual forms including the
small iota subscript and final small sigma an many combining signs (this
also allows other placement rules for accents. For printing purpose or
dispˆlay there's no rule, the document author enables or disables the
extended alphabet (disabled geerally for rendering with small resolutions).
The simple case mappngs however should preserve the distinctions present on
the extended alphabet, but simple uppercasing text should not convert
lowercase to all uppercase with an appended uppercase iota, even if this
maps a lowercase letter to a titlecase one (it would be lossy, simplet
casing rules should be lossless).
case mappings in the ùain UCD however ignore the contextual rules and
language-sˆpecific and style specific rules. But even if they are wrong
this cannot be changed. The simple mappings in the main UCD file should not
be assumed to be lossless. Actual case mappers do not use just these basic
rules which are just the most frequent mappings assumed (anyway any kinds
of case concersions introduces a loss, the degree of los is variable when
mappings are not concerned by just a single pair of simple letters, see
also the old difficulties about the German ess-tsett or sharp sign, and
about many ligatures that became plain letters in some contexts, including
the ampersand ' sign which originates from the et ligature, or the
German umlaut which inherits some old behavior of the superscripted small
latin letter e behaving like the Greek iota script in Fraktur font styles)

2014-11-06 16:55 GMT+01:00 Mike FABIAN maiku.fab...@gmail.com:


 I have a question about “Uppercase” in DerivedCoreProperties.txt:

 U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
 is listed as “Lowercase” in
 http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt :

1F80..1F87; Lowercase # L   [8] GREEK SMALL LETTER ALPHA WITH
 PSILI AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND
 PERISPOMENI AND YPOGEGRAMMENI

 But

 “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”
 is *not* listed as “Uppercase” in
 http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .

 Although U+1F80 seems to be Uppercase according to
 http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
 because it has a tolower mapping to U+1F80:

 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00
 0345N;;;1F88;;1F88
 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND
 PROSGEGRAMMENI;Lt;0;L;1F08 0345N1F80;

 Is the information in DerivedCoreProperties.txt correct or
 could this be a bug in DerivedCoreProperties.txt?

 The above is not only the case for U+1F88, but for several more characters.

 All the characters listed below have a tolower mapping in
 http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
 but are not listed in DerivedCoreProperties.txt as “Uppercase”:

 U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ
 U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ
 U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ
 U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ
 U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ
 U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ
 U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ
 U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ
 U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ
 U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ
 U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ
 U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ
 U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ
 U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ
 U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ
 U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ
 U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ
 U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ
 U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ
 U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ
 U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ
 U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ
 U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ
 U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ
 U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ
 U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ
 U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ

 Is that correct or a bug

RE: Question about Uppercase in DerivedCoreProperties.txt

2014-11-06 Thread Laurentiu Iancu
Hello,



The property Uppercase is a binary, informative property derived from 
General_Category (gc=Lu) and Other_Uppercase (OUpper=Y), as documented in 
Section 5.3 of UAX #44, at http://www.unicode.org/reports/tr44/#Uppercase.



All of the characters you enumerated are titlecase letters (gc=Lt) rather than 
uppercase letters (gc=Lu), and they are not specifically assigned 
Other_Uppercase (which would otherwise contradict their General_Category).  
Following the derivation, they do not have the Uppercase binary property.



For a visualization of the set of characters assigned the binary property 
Uppercase in relation to the set of Uppercase_Letter characters (gc=Lu), you 
can use the UnicodeSet comparison tool at 
http://www.unicode.org/cldr/utility/unicodeset.jsp.  Enter “[:gc=Lu:]” in one 
input field and “[:Uppercase:]” in the other field, then click on Compare.



Regards,

L.



-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mike FABIAN
Sent: Thursday, November 6, 2014 12:32 AM
To: unicode@unicode.org
Subject: Question about Uppercase in DerivedCoreProperties.txt





I have a question about “Uppercase” in DerivedCoreProperties.txt:



U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI

is listed as “Lowercase” in

http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt :



   1F80..1F87; Lowercase # L   [8] GREEK SMALL LETTER ALPHA WITH PSILI 
AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND 
YPOGEGRAMMENI



But



“U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI”

is *not* listed as “Uppercase” in

http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt .



Although U+1F80 seems to be Uppercase according to 
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt

because it has a tolower mapping to U+1F80:



1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 
0345N;;;1F88;;1F88

1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 
0345N1F80;



Is the information in DerivedCoreProperties.txt correct or could this be a bug 
in DerivedCoreProperties.txt?



The above is not only the case for U+1F88, but for several more characters.



All the characters listed below have a tolower mapping in 
http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt

but are not listed in DerivedCoreProperties.txt as “Uppercase”:



U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ

U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ

U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ

U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ

U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ

U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ

U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ

U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ

U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ

U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ

U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ

U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ

U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ

U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ

U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ

U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ

U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ

U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ

U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ

U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ

U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ

U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ

U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ

U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ

U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ

U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ

U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ



Is that correct or a bug?





--

Mike FABIAN mfab...@redhat.commailto:mfab...@redhat.com

☏ Office: +49-69-365051027, internal 8875027

睡眠不足はいい仕事の敵だ。

___

Unicode mailing list

Unicode@unicode.orgmailto:Unicode@unicode.org

http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Question about a Normalization test

2014-10-23 Thread Aaron Cannon
Hi all, from the latest version of the standard, on line 16977 of the
normalization tests, I am a bit confused by the NFC form.  It appears
incorrect to me.  Here's the line, sans comment:

0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE
0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300
0315 0062;

Just looking at column 2, which according to the comments at the top
is the NFC form:

0061 05AE 0305 0300 0315 0062:

This, however, does not appear to be in NFC form.

The first character, and the second or third characters do not
compose.  However, the first and fourth (0061  and 0300) do, composing
to 00E0.

Since there are no further compositions, the normalized form should be
00E0 05AE 0305 0315 0062

What am I missing?

Thanks in advance for your help!

Aaron
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about a Normalization test

2014-10-23 Thread Mark Davis ☕️
On Thu, Oct 23, 2014 at 6:54 PM, Aaron Cannon 
cann...@fireantproductions.com wrote:

 0061 05AE 0305 0300 0315 0062


http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cu0061+%5Cu05AE+%5Cu0305+%5Cu0300+%5Cu0315+%5Cu0062g=ccc

​0305 and 0300 have the same ccc, so the first one blocks the second.

http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G49576

The older spec is shorter, although not as precise:
http://www.unicode.org/reports/tr15/tr15-29.html#Specification

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Question about a Normalization test

2014-10-23 Thread Whistler, Ken
Aaron Cannon asked:



 Hi all, from the latest version of the standard, on line 16977 of the

 normalization tests, I am a bit confused by the NFC form.  It appears

 incorrect to me.  Here's the line, sans comment:



 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE

 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300

 0315 0062;



 Just looking at column 2, which according to the comments at the top

 is the NFC form:



 0061 05AE 0305 0300 0315 0062:



 This, however, does not appear to be in NFC form.



 The first character, and the second or third characters do not

 compose.  However, the first and fourth (0061  and 0300) do, composing

 to 00E0.



 Since there are no further compositions, the normalized form should be

 00E0 05AE 0305 0315 0062



 What am I missing?





Input is:



Code points: 0061 0305 0315 0300 05AE 0062

Ccc:0  230  232  230  2280



Output of canonical reordering is:



Code points: 0061 05AE 0305 0300 0315 0062

Ccc:0  228  230  230  2320



Next step is to start from 0061 and test each successive combining

mark, looking for composition candidates.



0061 does not compose with 05AE.

0061 does not compose with 0305.

0061 *could* compose with 0300 (00E0 = 0061 + 0300), *but*

0300 is *blocked* from 0061 by the intervening combining

mark 0305 with the *same* ccc value as 0300. So the

composition does not occur.

0061 does not compose with 0315.

The next character is 0062, ccc=0, a starter, so we are done.



For the relevant definitions, see:



http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G50628



and scroll down a couple pages to D115 on p. 139.



Test cases like this are included in NormalizationTest.txt precisely

to ensure that implementations are correctly detecting these

sequences where composition is blocked.



--Ken


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about a Normalization test

2014-10-23 Thread Aaron Cannon
On 10/23/14, Whistler, Ken ken.whist...@sap.com wrote:
 Test cases like this are included in NormalizationTest.txt precisely
 to ensure that implementations are correctly detecting these
 sequences where composition is blocked.

And I am indeed glad that they are, as I completely missed this small
but critical detail.

Thanks so much all!

Aaron
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Question about WordBreak property rules

2014-07-24 Thread Karl Williamson

http://www.unicode.org/draft/reports/tr29/tr29.html#WB6
indicates that there should be no break between the first two letters in 
the sequence

Hebrew_Letter Single_Quote Hebrew_Letter.

However, rule 7a just below indicates that there should be no break 
between a Hebrew_Letter and a Single_Quote even if what follows is not a 
Hebrew_Letter.


This is not contradictory, but it is suspicious.  It makes me wonder if 
there is an error in the specification.  Assuming there is not, then 
rule 7a ought to be before current rule 6, which itself should be 
divided so that there isn't redundant specification of the Hebrew_Letter 
rules.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about WordBreak property rules

2014-07-24 Thread Karl Williamson

On 07/24/2014 01:38 PM, Karl Williamson wrote:

http://www.unicode.org/draft/reports/tr29/tr29.html#WB6
indicates that there should be no break between the first two letters in
the sequence
Hebrew_Letter Single_Quote Hebrew_Letter.

However, rule 7a just below indicates that there should be no break
between a Hebrew_Letter and a Single_Quote even if what follows is not a
Hebrew_Letter.

This is not contradictory, but it is suspicious.  It makes me wonder if
there is an error in the specification.  Assuming there is not, then
rule 7a ought to be before current rule 6, which itself should be
divided so that there isn't redundant specification of the Hebrew_Letter
rules.


In reading this after I sent it, I'm not sure I was clear enough.
Rule 6 implies that you need additional context to decide whether to 
break between a Hebrew_Letter followed by a Single_Quote.


Yet Rule 7a says that you don't need any additional context; you break 
always.


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


question to Akkadian

2014-05-19 Thread Werner LEMBERG

Folks,


I'm trying to find an encoding of the following Akkadian cuneiform:

 ___  ___  ___
 \ /  \ /  \ /
  |||
  | /| | /| |
  | \| | \| |
  |||
   |\___
   |/


My knowledge of cuneiforms is zero, but I can read Unicode tables :-)
However, I haven't found it in the Akkadian cuneiforms block.  Either
I've missed it, or it gets represented as a ligature, or ...

In case it is a ligature: Where should I look to find well drawn
glyphs?  Or to formulate it more generally: If I have a cuneiform
text, where can I find glyph images to identify them?


Werner
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: question to Akkadian

2014-05-19 Thread Tom Gewecke

On May 19, 2014, at 8:40 AM, Werner LEMBERG wrote:

  If I have a cuneiform
 text, where can I find glyph images to identify them?

You might want to specify what you mean by text.  A photo of an inscription?  
Something from a printed book?

Because of the considerable variation in glyphs over the long time period when 
this script was used, you may need to consult a reference that tries to cover 
that, like Labat's Manuel d'Épigraphie Akkadienne.   ___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: question to Akkadian

2014-05-19 Thread Werner LEMBERG

  If I have a cuneiform text, where can I find glyph images to
 identify them?
 
 You might want to specify what you mean by text.  A photo of an
 inscription?  Something from a printed book?

I'm interested in representing one of the so-called Hurrian songs
(tablet H.6, containing musical notation) with Unicode, cf.

  https://en.wikipedia.org/wiki/Hurrian_songs

A much better drawing of the tablet can be found here on page 503:

  http://digital.library.stonybrook.edu/cdm/ref/collection/amar/id/7250

The character in question is the first one on the left after the
double line.

A nice article on this song can be found here:

  
http://individual.utoronto.ca/seadogdriftwood/Hurrian/Website_article_on_Hurrian_Hymn_No._6.html


Werner
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: question to Akkadian

2014-05-19 Thread Tom Gewecke

On May 19, 2014, at 9:21 AM, Werner LEMBERG wrote:

 
 I'm interested in representing one of the so-called Hurrian songs
 (tablet H.6, containing musical notation) with Unicode, cf.
 
  https://en.wikipedia.org/wiki/Hurrian_songs

That says it represents qáb, which seems to be a version of Labat 88, which is  
U+1218F KAB.

Unfortunately none of my fonts give the version shown in that drawing, but 
there may be one.

Photo from Labat attached.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: question to Akkadian

2014-05-19 Thread Werner LEMBERG

 I'm interested in representing one of the so-called Hurrian songs
 (tablet H.6, containing musical notation) with Unicode, cf.
 
  https://en.wikipedia.org/wiki/Hurrian_songs
 
 That says it represents qáb, which seems to be a version of Labat
 88, which is U+1218F KAB.
 
 Unfortunately none of my fonts give the version shown in that
 drawing, but there may be one.

Thanks a lot!  Will try to get the book you've mentioned...

BTW, it seems to me that cuneiforms would benefit enormously by
introducing variant selectors, collecting all cuneiform variants in a
database similar to the CJK stuff.


Werner

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Fwd: Terminology question re ASCII

2013-10-29 Thread Christopher Vance
Sorry, should have cc:d the list. Assume original mail was from a list
member.

-- Forwarded message --
From: Christopher Vance cjsva...@gmail.com
Date: 29 October 2013 16:58
Subject: Re: Terminology question re ASCII
To: Mark Davis ☕ m...@macchiato.com


Of course, once you have 8-bit characters in the upper range from 0x80 up,
you can only know intrinsically that it's not actually ASCII, and that
anybody who says it is, is probably lying.

You can only determine the actual character set used from extrinsic
information. Is the 8th bit just parity? Is it a Microsoft set with those
graphical things? Is it one of the Latin-N sets (which one)? EBCDIC?
Something else?


On 29 October 2013 16:38, Mark Davis ☕ m...@macchiato.com wrote:

 Normally the term ASCII just refers to the 7-bit form. What is sometimes
 called 8-bit ASCII is the same as ISO Latin 1. If you want to be
 completely clear, you can say 7-bit ASCII.


 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **


 On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote:

 Quick question on terminology use concerning a legacy encoding:

 If one refers to plain ASCII, or plain ASCII text or ...
 characters, should this be taken strictly as referring to the 7-bit basic
 characters, or might it encompass characters that might appear in an 8-bit
 character set (per the so-called extended ASCII)?

 I've always used the term ASCII in the 7-bit, 128 character sense, and
 modifying it with plain seems to reinforce that sense. (Although plain
 text in my understanding actually refers to lack of formatting.)

 Reason for asking is encountering a reference to plain ASCII describing
 text that clearly (by presence of accented characters) would be 8-bit.

 The context is one of many situations where in attaching a document to an
 email, it is advisable to include an unformatted text version of the
 document in the body of the email. Never mind that the latter is probably
 in UTF-8 anyway(?) - the issue here is the terminology.

 TIA for any feedback.

 Don Osborn

 Sent via BlackBerry by ATT






-- 
Christopher Vance



-- 
Christopher Vance


Re: Terminology question re ASCII

2013-10-29 Thread Jukka K. Korpela

2013-10-29 6:12, d...@bisharat.net wrote:


If one refers to plain ASCII, or plain ASCII text or ...
characters, should this be taken strictly as referring to the 7-bit
basic characters, or might it encompass characters that might appear
in an 8-bit character set (per the so-called extended ASCII)?


In correct usage, “ASCII” refers to a specific standard, namely 
“American  National  Standard for  Information  Systems  -

Coded  Character  Sets  - 7-Bit  American  National  Standard  Code
for  Information  Interchange (7-Bit  ASCII)”,  ANSI X3.4-1986, except 
in historical presentations, where it might refer to predecessors of 
that standard (earlier versions of ASCII).


In common usage, “ASCII” is also used to denote a) text data in general, 
b) some 8-bit encoding that has ASCII characters as its 7-bit subset, 
and c) other things. This can be very confusing, and that’s why the 
standard has the parenthetic note “7-Bit ASCII” and why people often use 
“US-ASCII” as the name of the ASCII encoding. The clarifying prefixes 
are, however, also misleading in the sense that they suggests the 
existence of other ASCIIs.



I've always used the term ASCII in the 7-bit, 128 character sense,
and modifying it with plain seems to reinforce that sense.
(Although plain text in my understanding actually refers to lack of
formatting.)


The attribute “plain” probably refers to plain text in the contexts 
given. Once people make the mistake of writing “ASCII” when they mean 
“text”, further confusion will be caused by attributes like “plain”, 
which are indeed ambiguous.



Reason for asking is encountering a reference to plain ASCII
describing text that clearly (by presence of accented characters)
would be 8-bit.


It probably means “plain text”. But it could also mean “text in an 8-bit 
encoding”, if the author thinks of encodings like ISO 8859-1, 
windows-1252, ISO 8859-2, cp-850, Mac Roman, etc., as “extended ASCII” 
and even drops the attribute “extended”. It is conceivable that “plain 
ASCII” is even used to emphasize that the text is not in a Unicode encoding.



The context is one of many situations where in attaching a document
to an email, it is advisable to include an unformatted text version
of the document in the body of the email. Never mind that the latter
is probably in UTF-8 anyway(?) - the issue here is the terminology.


The proper term for plain text is “plain text”. The word “unformatted” 
is often used, and might be seen as intuitively descriptive 
(unformatted, as opposite to text that contains formatting like bolding, 
colors, and different fonts), but it is risky. For one thing, plain text 
is often displayed “as is” with respect to line breaks and indentation, 
i.e. as “preformatted” (as in pre elements in HTML). Moreover, text 
that is not plain text need not be formatted. It could be e.g. an XML 
file where XML tags are used to mark up structural parts of the text, 
without causing or implying any specific formatting in rendering.


Yucca






Re: Terminology question re ASCII

2013-10-29 Thread David Starner
On Mon, Oct 28, 2013 at 10:38 PM, Mark Davis ☕ m...@macchiato.com wrote:
 Normally the term ASCII just refers to the 7-bit form. What is sometimes
 called 8-bit ASCII is the same as ISO Latin 1. If you want to be
 completely clear, you can say 7-bit ASCII.

One of the first hits for 8-bit ASCII on Google Books is When the
Mac came out. it supported 8-bit ASCII., courtesy of Introduction to
Digital Publishing, by David Bergsland. (He also seems to be under
the delusion that MS-DOS used 7-bit ASCII.) I don't think you can
assume anything about 8-bit ASCII besides the lower bits (hopefully)
begin compatible with ASCII.

-- 
Kie ekzistas vivo, ekzistas espero.




Re: Terminology question re ASCII

2013-10-29 Thread Philippe Verdy
8-bit ASCII is not so clear !

The reason for that is the historic documentation of many softwares,
notably for the BASIC language, or similar like Excel, or even more recent
languages like PHP, offering functions like CHR$(number) and
ASC(string) to convert a string to the numeric 8-bit ASCII code of its
first character or the reverse. The effective encoding of strings was in
fact not specified at all and could be any 8-bit encoding used on the
platform.

Only in more recent versions of implementtions of these languages, they
specify that the encoding of their strings is now based on Unicode (most
often UTF-16, so that 8-bit values now produce the same result as
ISO-8859-1), but this is not enforced if a compatibility working mode was
kept (e.g. in PHP which still uses unspecified 8-bit encodings for its
strings in most of its API, or in Python that distinguishes types for 8-bit
encoded strings and Unicode-encoded strings).



2013/10/29 Mark Davis ☕ m...@macchiato.com

 Normally the term ASCII just refers to the 7-bit form. What is sometimes
 called 8-bit ASCII is the same as ISO Latin 1. If you want to be
 completely clear, you can say 7-bit ASCII.


 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **


 On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote:

 Quick question on terminology use concerning a legacy encoding:

 If one refers to plain ASCII, or plain ASCII text or ...
 characters, should this be taken strictly as referring to the 7-bit basic
 characters, or might it encompass characters that might appear in an 8-bit
 character set (per the so-called extended ASCII)?

 I've always used the term ASCII in the 7-bit, 128 character sense, and
 modifying it with plain seems to reinforce that sense. (Although plain
 text in my understanding actually refers to lack of formatting.)

 Reason for asking is encountering a reference to plain ASCII describing
 text that clearly (by presence of accented characters) would be 8-bit.

 The context is one of many situations where in attaching a document to an
 email, it is advisable to include an unformatted text version of the
 document in the body of the email. Never mind that the latter is probably
 in UTF-8 anyway(?) - the issue here is the terminology.

 TIA for any feedback.

 Don Osborn

 Sent via BlackBerry by ATT






RE: Terminology question re ASCII

2013-10-29 Thread Shawn Steele
I would concur.  When I hear “8 bit ASCII” the context is usually confusing the 
term with any of what we call “ANSI Code Pages” in Windows.  (or similar ideas 
on other systems).

It’s also usually the prelude to a conversation asking the requestor to back up 
5 or 6 steps and explain what they’re really trying to do because something’s 
probably a bit confused.

-Shawn

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Philippe Verdy
Sent: Tuesday, October 29, 2013 7:49 AM
To: Mark Davis ☕
Cc: Donald Z. Osborn; unicode
Subject: Re: Terminology question re ASCII

8-bit ASCII is not so clear !

The reason for that is the historic documentation of many softwares, notably 
for the BASIC language, or similar like Excel, or even more recent languages 
like PHP, offering functions like CHR$(number) and ASC(string) to convert a 
string to the numeric 8-bit ASCII code of its first character or the 
reverse. The effective encoding of strings was in fact not specified at all and 
could be any 8-bit encoding used on the platform.

Only in more recent versions of implementtions of these languages, they specify 
that the encoding of their strings is now based on Unicode (most often UTF-16, 
so that 8-bit values now produce the same result as ISO-8859-1), but this is 
not enforced if a compatibility working mode was kept (e.g. in PHP which 
still uses unspecified 8-bit encodings for its strings in most of its API, or 
in Python that distinguishes types for 8-bit encoded strings and 
Unicode-encoded strings).


2013/10/29 Mark Davis ☕ m...@macchiato.commailto:m...@macchiato.com
Normally the term ASCII just refers to the 7-bit form. What is sometimes called 
8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, 
you can say 7-bit ASCII.


Markhttps://plus.google.com/114199149796022210033

— Il meglio è l’inimico del bene —

On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.netmailto:d...@bisharat.net 
wrote:
Quick question on terminology use concerning a legacy encoding:

If one refers to plain ASCII, or plain ASCII text or ... characters, 
should this be taken strictly as referring to the 7-bit basic characters, or 
might it encompass characters that might appear in an 8-bit character set (per 
the so-called extended ASCII)?

I've always used the term ASCII in the 7-bit, 128 character sense, and 
modifying it with plain seems to reinforce that sense. (Although plain text 
in my understanding actually refers to lack of formatting.)

Reason for asking is encountering a reference to plain ASCII describing text 
that clearly (by presence of accented characters) would be 8-bit.

The context is one of many situations where in attaching a document to an 
email, it is advisable to include an unformatted text version of the document 
in the body of the email. Never mind that the latter is probably in UTF-8 
anyway(?) - the issue here is the terminology.

TIA for any feedback.

Don Osborn

Sent via BlackBerry by ATT





Re: Terminology question re ASCII

2013-10-29 Thread Philippe Verdy
2013/10/29 Shawn Steele shawn.ste...@microsoft.com

 I would concur.  When I hear “8 bit ASCII” the context is usually
confusing the term with any of what we call “ANSI Code Pages” in Windows.
 (or similar ideas on other systems).


Of course not just Windows (or MSDOS). This was seen as well in vrious
early OSes for personnal computers from various brands nd various countries
(not just US like Atari, but as well from Japan, France, Germany, UK,
Sweden and certainly others, where neither the US-only ASCII or ANSI
were standard). We've also seen these documents speaking bout US-ASCII
when they actually meant an 8-bit encoding whose lower 7-bit part matched
ISO 646 for US (i.e. the real ASCII standard from ANSI).

Due to Windows however (also in IBM OS/2, IBM DOS, and other derived OSes
by Digital Research for example, and also in some brands of Unix, CPM,
VMS... as well as in early development/porting for Linux), the ambiguity
rose when people started to speak about ANSI as an encoding when it
actully standard body developing various standards (including for other
encodings), and later this was corrected (not in Windows which uses the
incorrect terms ANSI codepage when none of them were actaully coming from
ANSI but from Microsoft, IBM, or some other national bodies, and later
modified by Microsoft !) by simply using ASCII instead of ANSI, when
they should have just spoken of **some** range of 8-bit encodings supported
by the underlying OS whose lower 7-bit part was more or less based on some
national version of ISO 646 (or sometimes only in its invariant part,
excluding significant parts reserved for C0 controls but tweaked to encode
printable characters, e.g. for in VISCII or in IBM PC codepages for DOS).

7-bit and 8-bit encodings have always been a mess to reference, with
frequently ambiguous or wrong names, and many aliases being developed when
trying to disambiguate them (e.g. the IBM and Microsoft numeric codepages,
later aliased again on other systems !). This lead to the creation of an
international registry for encoding identifiers to fix the recommended
idenfiers for interchange and deprecate the other aliases (but Microsoft
never used it directly, it continued using its own numeric codepages, and
just accepted a few named aliases, sometimes incorrectly, for example when
Microsoft Frontpage confused and aliased ISO-8859-1 and windows-1252,
changing them in incomptible ways, forcing now HTML5 do declare that
ISO-8859-1 is no longer this standard but windows-1252).


Re: Terminology question re ASCII

2013-10-28 Thread Mark Davis ☕
Normally the term ASCII just refers to the 7-bit form. What is sometimes
called 8-bit ASCII is the same as ISO Latin 1. If you want to be
completely clear, you can say 7-bit ASCII.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote:

 Quick question on terminology use concerning a legacy encoding:

 If one refers to plain ASCII, or plain ASCII text or ... characters,
 should this be taken strictly as referring to the 7-bit basic characters,
 or might it encompass characters that might appear in an 8-bit character
 set (per the so-called extended ASCII)?

 I've always used the term ASCII in the 7-bit, 128 character sense, and
 modifying it with plain seems to reinforce that sense. (Although plain
 text in my understanding actually refers to lack of formatting.)

 Reason for asking is encountering a reference to plain ASCII describing
 text that clearly (by presence of accented characters) would be 8-bit.

 The context is one of many situations where in attaching a document to an
 email, it is advisable to include an unformatted text version of the
 document in the body of the email. Never mind that the latter is probably
 in UTF-8 anyway(?) - the issue here is the terminology.

 TIA for any feedback.

 Don Osborn

 Sent via BlackBerry by ATT





Re: UTF-8 ill-formed question

2012-12-16 Thread Otto Stolz

Hello,

am 2012-12-15 schrieb Philippe Verdy:

But there's still a bug (or request for enhancement) for your Pocket
converters :

- For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates)
from the sets of convertible codepoints.

- But you don't exclude this range in the case of your UTF-8 and UTF-32
magic encoders which could forget this case. Of course your encoder would
create distinct sequences for these code points, but they are not valid
UTF-8 or valid UTF-32 encodings.


Only the UTF-16 variant is really *my* “magic pocket encoder” (MPE);
the author is nominated on every one of the three.

I would not demand more from those MPEs than converting
a valid UCS character to a valid, and equivalen, UTF
sequence – and to illustrate the underlying algorithm.
I guess, originally, they were meant as jokes – partially,
at least; I have used them as a didactic device, in my
beginner's lecture in Unicode.

Clearly, Mike Ayers made the point that the UTF-32 encoding
is nothing but a simple shortcut (in the terms of its two
predecessors). His one-row-only MPE expresses this quite
aptly, and any additional branch would spoil the impression.

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.

Cheers,
  Otto Stolz








Re: UTF-8 ill-formed question

2012-12-16 Thread Philippe Verdy
2012/12/16 Otto Stolz otto.st...@uni-konstanz.de


 The reason I excluded the surrogates from my UTF-8 MPE
 was really that I needed additional space for the user’s
 guide on the reverse side.


Why adding a row in the front side would have not preserved the space for
the reverse side ?
If this is regarded as didactic tool, addin this row would have focused
more on the validity constraint of UTF-8, enforced in TUS and now as well
in the IETF RFC made by ISO to be fully compatible with TUS.

I think that the row was missing only because your MPE was initially
designed for the old UTF-8 definition in the now obsolete ISO definition
where the validity constraint was not clear (it was not clear as well on
past variations of UTF-8 that are still existing in Java (not really for
plain-text interchange but for the 8-native JNI API compatible with 8-bit C
strings, and as part of the serialization format of compiled Java classes).

Add this missing row, Everything in the reverse side can remain the same
(or can be using a less cryptic compact description of how it works).


Re: UTF-8 ill-formed question

2012-12-16 Thread Otto Stolz

Hello,

2012/12/16 Otto Stolz otto.st...@uni-konstanz.de

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.


Sorry, typo; I meant: “my UTF-16 MPE”. I added that
extra row (with the branch excluding the surrogates)
to gain extra space on the reverse sode.

Am 2012-12-16 schrieb Philippe Verdy:

Add this missing row, Everything in the reverse side can remain the same
(or can be using a less cryptic compact description of how it works).


I will certainly not change Marco Cimarosti’s original design
of his UTF-8 MPE.

Best wishes,
  Otto Stolz





Re: UTF-8 ill-formed question

2012-12-16 Thread Philippe Verdy
But the old Marco design at that time (2002) was still ignoring the Unicode
UTF-8 conformance constraints, as demonstrated in its use of the obsolete
U-00n notation (mathcing the obsolete ISO/IETF definition). If the
puprpose of this pocket conversion card is to be used for tutorial purpose,
omitting the validity constraint is not very didactic and could continue to
cause compatibility troubles if theses rules are not exposed and learnt,
and consequently ignored in applications.

Note that in my previous post, I dropped the extra leading zeroes in
Marco's use of the obsolete U-00n notation of supplementary
codepoints, but I forgot to change the U- prefix into U+ for these
supplementary code points. Sorry about that.

Of course there are better ways to present this card to something that will
be printed (then placed under a reusable plastic cover, like an identity
card or driver licence card, and the size of a credit card for your
jacket), using HTML or PDF instead of just this basic plain-text format.
The usage instructions on the back side would also be clearer, and there
would be additional visual hints to make it more obvious. And you would be
less restricted for drawing the diagram without using the ugly characters
of box framing symbols (only usable with monospaced fonts which are ugly
for presenring the instructions). The pocket card would also use background
colors to better exhibit an all white frame where you need to write
something (better than using a dot), and what is fixed in the layout.

There are also other possible presentations, if printing a similar tool on
a carton : just use rotating wheels (1 for VW, 1 for X, 1 for Y, you may
ignore the Z wheel which will display the same value in the input and in
the output window) and a front masking carton with windows showing the
input and the result of the conversion ! You don't need any pen, it's
reusable, simpler and faster to use.

2012/12/16 Doug Ewell d...@ewellic.org

 I remember Marco's original post in 2002. His intent was to give people
 with an actual U+ code point that needed converting—like James Lin ten
 years later—a quick way to do so without getting immersed in all the
 bit-shifting math.

 If this were a routine being run by a computer, or a tutorial on UTF-8, I
 would agree that it should have taken loose surrogates into account. But
 it's not. It's just a quick manual reference guide, and loose surrogates
 are 0.0001% of the real-world problem for users like James.

 While I note that Philippe's amended version seems straightforward and in
 keeping with Marco's original intent (short and simple), I'd like to
 suggest that neither Marco for creating the original guide, nor anyone else
 for doing up UTF-16 and UTF-32 versions, nor Otto for reposting them on the
 list this week, need to be beaten up any further over this edge case.


 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­



Re: UTF-8 ill-formed question

2012-12-16 Thread Doug Ewell

Philippe Verdy wrote:

If the puprpose of this pocket conversion card is to be used for 
tutorial purpose,


It never was. It was a quick reference guide for experienced users who 
already understood the caveats.


Not worth arguing further.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 





Re: UTF-8 ill-formed question

2012-12-12 Thread Otto Stolz

Hello,

am 2012-12-11 20:16, schrieb James Lin:

If i have a code point: U+4E8C or 二
In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C.
Where is this BA comes from?


Cf. http://skew.org/cumped/.

Enclosed are the (almost original) version of “€œCima’s Magic
UTF-8 Pocket encoder”€ (2004), and its two followers for
more UTFs. Display or print with a fixed-pitch font,
such as Lucida Console or Courier New. Enjoy!

Cheers,
   Otto Stolz


Side 1 (print and cut out):

++---+---+--+
| U+ | yy zz |Cima's UTF-8 Magic | Hex= |
| U+007F | !  !  |Pocket Encoder | B-4  |
| YZ | .  .  |   |  |
++---+---+ Vers. 1.1 | 0=00 |
| U+0080 | 3x xy | 2y zz |  30 June 2004 | 1=01 |
| U+07FF | 3. .. | 2. !  |   | 2=02 |
|XYZ | .  .  | .  .  |  M.C. | 3=03 |
++---+---+---+   | 4=10 |
| U+0800 | 32 ww | 2x xy | 2y zz |   | 5=11 |
| U+ | !  !  | 2. .. | 2. !  |   | 6=12 |
|   WXYZ | E  .  | .  .  | .  .  |   | 7=13 |
++---+---+---+---+ 8=20 |
| U-0001 | 33 0v | 2v ww | 2x xy | 2y zz | 9=21 |
| U-000F | !  0. | 2. !  | 2. .. | 2. !  | A=22 |
|  VWXYZ | F  .  | .  .  | .  .  | .  .  | B=23 |
++---+---+---+---+ C=30 |
| U-0010 | 33 10 | 20 ww | 2x xy | 2y zz | D=31 |
| U-0010 | !  1. | 2. !  | 2. .. | 2. !  | E=32 |
|   WXYZ | F  4  | 8  .  | .  .  | .  .  | F=33 |
++---+---+---+---+--+

Side 2 (print, cut out, and glue on back of side 1):

+---+
| Cima's UTF-8 Magic Pocket Encoder - User's Manual |
| (vers. 1.1, 30 June 2004, by Marco Cimarosti) |
|   |
| - Left column: min and max Unicode scalar values: |
|   pick the row that applies to the code point you |
|   want to convert to UTF-8. Letters V..Z mark the |
|   hexadecimal digits that have to be processed.   |
| - Right column: hexadecimal to base-4 table.  |
| - Central columns: work area to compute each octet|
|   (1 to 4) that constitute UTF-8 octet sequences. |
| Convert each digit marked by V..Z from hex. to|
| b.-4. Write b.-4 digits on the dots placed under  |
| letters v..z (two b.-4 digits per hex. digit).|
| Convert 2-digit base-4 number to hex. digits and  |
| write them on the dots on the line. That is your  |
| UTF-8 sequence in hex.  ! Exclamation marks show  |
| passages that may be skipped, either because the  |
| digit is hard-coded, or because it may be copied  |
| directly from the scalar value.   |
+---+

Enjoy!

Marco
Obverse: Print with a fixed-width font, such as Lucida Console,
and cut out.

╔╦═╦═╗
║ U+ ║ W  X  Y  Z  ║ Otto’s Magic Pocket Encoder ║
║ U+D7FF ║ !  !  !  !  ║ for UTF-16  ╔═══╣
║   WXYZ ║ _  _  _  _  ║ ║Vvv │Vvv ║
╟╫─╢ Version 1.1 ║Uuu │Uuu ║
║ U+E000 ║ W  X  Y  Z  ║ ©2004-07-05 ║ ttT│ ttT║
║ U+ ║ !  !  !  !  ║ ║___ │___ ║
║   WXYZ ║ _  _  _  _  ║ ║ ┼ ║
╟╫─╚═╣0=00 │ 138=20 ║
║ U-0001 ║ 31 2t tu uv │ 31 3v Y  Z  ║ 001=01 │ 209=21 ║
║ U-000F ║ !  2_ __ __ │ !  3_ !  !  ║ 012=02 │ 21A=22 ║
║  TUVYZ ║ D  _  _  _  │ D  _  _  _  ║ 023=03 │ 22B=23 ║
╟╫─┼─╢ 034=10 │ 23C=30 ║
║ U-0010 ║ 31 23 3u uv │ 31 3v Y  Z  ║ 105=11 │ 30D=31 ║
║ U-0010 ║ !  !  3_ __ │ !  3_ !  !  ║ 116=12 │ 31E=32 ║
║   UVYZ ║ D  B  _  _  │ D  _  _  _  ║ 127=13 │ 32F=33 ║
╚╩═╧═╩═══╝


:1:2:3:4:5:6..


Reverse: Cut out and paste on back of obverse.

╔╗
║ Otto’s Magic Pocket Encoder for UTF-16 Version 1.1 ║
║ User’s Manual (inspired from Cima’s UTF-8 MPE) ║
╠╣
║• Left column: min and max Unicode scalar values: pick the  ║
║  row that applies to the code point to be converted.   ║
║  T…Z mark the hexadecadic digits that have to be processed.║
║• Central column: work area to compute UTF-16BE code units. ║
║• Right column: hexadecadic to quaternary conversion tables:║
║   for T to tt; = for U/V to uu/vv (step 1) and for step 2.║
║1. Convert each digit marked by T…V from hex to quat. Write ║
║   quat digits on the underscores placed under letters t…v. ║
║2. Convert 2-digit quat numbers to hex digits or copy digits║
║   W…Z, as indicated, and write them on the underscores on  ║
║   the next line. That’s your UTF-16BE sequence in hex. ║

Re: UTF-8 ill-formed question

2012-12-11 Thread Asmus Freytag

On 12/11/2012 11:50 AM, vanis...@boil.afraid.org wrote:

From: James Lin James_Lin_at_symantec.com

Hi
Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow 
 the pattern of UTF-8 byte-sequences, i just wondering how or why?
If i have a code point: U+4E8C or 二
In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this BA
comes from?

thanks
-James

Each of the UTF encodings represents the binary data in different ways. So we
need to break the scalar value, U+4E8C, into its binary representation before
we proceed.

4E8C - 0100 1110 1000 1100

Then, we need to look up the rules for UTF-8. It states that code points
between U+800 and U+ are encoded with three bytes, in the form 1110
10xx 10xx. So plugging in our data, we get

 4  E8 C
   0100   1110 10-00 1100
      //   \\
+ 1110 10xx 10xx

= 11100100 10111010 10001100
or  E  4 B  A 8  C

-Van Anderson


Nice!

A./

PS: I fixed a missing \



Re: UTF-8 ill-formed question

2012-12-11 Thread James Lin
thank you so much everyone for explaining it. I got it now!

-James

On 12/11/12 11:50 AM, vanis...@boil.afraid.org
vanis...@boil.afraid.org wrote:

From: James Lin James_Lin_at_symantec.com
 Hi
 Does anyone know why ill-form occurred on the UTF-8? besides it doesn't
follow  the pattern of UTF-8 byte-sequences, i just wondering how or
why?
 If i have a code point: U+4E8C or 二
 In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this
BA 
 comes from?
 
 thanks
 -James 

Each of the UTF encodings represents the binary data in different ways.
So we 
need to break the scalar value, U+4E8C, into its binary representation
before 
we proceed.

4E8C - 0100 1110 1000 1100

Then, we need to look up the rules for UTF-8. It states that code points
between U+800 and U+ are encoded with three bytes, in the form
1110 
10xx 10xx. So plugging in our data, we get

4  E8 C
  0100   1110 10-00 1100
     //   \
+ 1110 10xx 10xx

= 11100100 10111010 10001100
or  E  4 B  A 8  C

-Van Anderson





Question about normalization tests

2012-12-10 Thread Edwin Hoogerbeets
Hi there,

I'm going through the NormalizationTests.txt in the 6.3.0d1 database,
and I ran across this line:

0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE
0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300
0315 0062; # (a◌̅◌̕◌̀◌֮b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; ) 
LATIN SMALL
LETTER A, COMBINING OVERLINE, COMBINING COMMA ABOVE RIGHT, COMBINING
GRAVE ACCENT, HEBREW ACCENT ZINOR, LATIN SMALL LETTER B

The relevant parts for my question are:

Source: 0061 0305 0315 0300 05AE 0062
NFD: 0061 05AE 0305 0300 0315 0062
NFC: 0061 05AE 0305 0300 0315 0062

I agree with the NFD decomposition result, but the NFC one seems wrong
to me. If you look at rule D117 in the Unicode Spec
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf (I couldn't find
the spec for 6.3 -- hopefully 6.2 is close enough), it gives the
algorithm for NFC composition. The way I interpret it, this is how the
composition proceeds:

Starting with the NFD decomposition string, we retrieve the combining
classes for each character from the UnicodeData.txt file:

0061 - 0
05AE - 228
0305 - 230
0300 - 230
0315 - 232
0062 - 0

You start at the first character after the starter (0061, with ccc=0),
which is 05AE. There is no primary composition for the sequence 0061
05AE, so you move on.

Looking at 0305, it is not blocked from 0061, so check the primary
composition for 0061 0305. There is none for that either, so move on.

Looking at 0300, it is also not blocked from 0061, so check the primary
composition for 0061 0300. There is a primary composition for that
sequence, 00E0, so replace the starter with that, delete the 0300, and
continue. The string looks like this now:

00E0 - 0
05AE - 228
0305 - 230
0315 - 232
0062 - 0

Checking 0315 and 0062, they are not blocked, but there is no
composition with 00E0, so the algorithm ends with the result:
00E0 05AE 0305 0315 0062

This disagrees with what it says in the normalization tests file as
listed above. The question is, did I misunderstand the algorithm, or is
this perhaps a bug in the data file?

Thanks,

Edwin




Re: Question about normalization tests

2012-12-10 Thread Mark Davis ☕
0300 *is* blocked, because there is a preceding character (0305) that has
the same combining class (230).

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Dec 10, 2012 at 11:55 AM, Edwin Hoogerbeets
ehoogerbe...@gmail.comwrote:

 Looking at 0300, it is also not blocked from 0061, so check the primary
 composition for 0061 0300. There is a primary composition for that
 sequence, 00E0, so replace the starter with that, delete the 0300, and
 continue. The string looks like this now:



RE: Question about normalization tests

2012-12-10 Thread Whistler, Ken
Your misunderstanding is at the highlighted statement below. Actually 0300 *is* 
blocked from 0061 in this sequence, because it is preceded by a character with 
the same canonical combining class (i.e. U+0305, ccc=230). A blocking context 
is the preceding combining character either having ccc=0 or having ccc greater 
than or equal to the character being checked.

--Ken


Starting with the NFD decomposition string, we retrieve the combining classes 
for each character from the UnicodeData.txt file:

0061 - 0
05AE - 228
0305 - 230
0300 - 230
0315 - 232
0062 - 0

You start at the first character after the starter (0061, with ccc=0), which is 
05AE. There is no primary composition for the sequence 0061 05AE, so you move 
on.

Looking at 0305, it is not blocked from 0061, so check the primary composition 
for 0061 0305. There is none for that either, so move on.

Looking at 0300, it is also not blocked from 0061, so check the primary 
composition for 0061 0300. There is a primary composition for that sequence, 
00E0, so replace the starter with that, delete the 0300, and continue. The 
string looks like this now:

00E0 - 0
05AE - 228
0305 - 230
0315 - 232
0062 - 0

Checking 0315 and 0062, they are not blocked, but there is no composition with 
00E0, so the algorithm ends with the result:
00E0 05AE 0305 0315 0062

This disagrees with what it says in the normalization tests file as listed 
above. The question is, did I misunderstand the algorithm, or is this perhaps a 
bug in the data file?

Thanks,

Edwin



Fwd: Re: Question about normalization tests

2012-12-10 Thread Edwin Hoogerbeets
Ah yes, I did indeed miss the equal to part. I fixed up my code and
now it works as expected.

Thanks to Mark and Ken for your help and speedy response!

Edwin

On 12/10/2012 12:57 PM, Whistler, Ken wrote:

 Your misunderstanding is at the highlighted statement below. Actually
 0300 **is** blocked from 0061 in this sequence, because it is preceded
 by a character with the same canonical combining class (i.e. U+0305,
 ccc=230). A blocking context is the preceding combining character
 either having ccc=0 or having ccc greater than *or equal to* the
 character being checked.

  

 --Ken

  






A question about the default grapheme cluster boundaries with U+0020 as the grapheme base

2012-06-01 Thread Konstantin Ritt
It seems like there is an inconsistency between what the default
grapheme clusters specification says and what the test results are
expected to be:

The UAX#29 says:
 Another key feature (of default Unicode grapheme clusters) is that bdefault 
 Unicode grapheme clusters are atomic units with respect to the process of 
 determining the Unicode default line, word, and sentence boundaries/b.
Also this mentioned in UAX#14:
 Example 6. Some implementations may wish to tailor the line breaking 
 algorithm to resolve grapheme clusters according to Unicode Standard Annex 
 #29, “Unicode Text Segmentation” [UAX29], as a first stage. bGenerally, the 
 line breaking algorithm does not create line break opportunities within 
 default grapheme clusters/b; therefore such a tailoring would be expected 
 to produce results that are close to those defined by the default algorithm. 
 However, if such a tailoring is chosen, characters that are members of line 
 break class CM but not part of the definition of default grapheme clusters 
 must still be handled by rules LB9 and LB10, or by some additional tailoring.

However, U+0020 (SP), U+0308 (CM) in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9 prohibits break between U+0020 (Other), U+0308 (Entend).
Section 9.2 Legacy Support for Space Character as Base for Combining
Marks in UAX#29 clarifies why there is a line break occurs, but the
fact that the statements above are false statements and introduce some
ambiguility.
In case the space character is not a grapheme base anymore the
grapheme cluster breaking rules need to be updated.

Kind regards,
Konstantin




Re: Question on U+33D7

2012-02-24 Thread Shriramana Sharma
Grandpa grandpa I wanna hear the story about the turtles *now*! :-)

Sent from my Android phone


Re: Question on U+33D7

2012-02-24 Thread Matt Ma
On Fri, Feb 24, 2012 at 5:18 AM, Shriramana Sharma samj...@gmail.com wrote:
 Grandpa grandpa I wanna hear the story about the turtles *now*! :-)

 Sent from my Android phone

Thanks all for the enlightening reply.

My intent was sorting using UCA but it really did not matter much
because U+33D7 was sorted after PH in either case (0050 0048 or
0070 0048“). I was curious why U+33D7 was defined and stayed that way
in Unicode, and it was answered more than comprehensively.

Regards,
Matt




Question on U+33D7

2012-02-23 Thread Matt Ma
It is defined as 33D7;SQUARE PH;So;0;L;square 0050
0048N;SQUARED PH in UnicodeData.txt, but it is shown as pH
in code chart. Should it be 0070 0048 or PH?

Thanks,
Matt



Re: Question on U+33D7

2012-02-23 Thread António Martins-Tuválkin
On 2012/2/23 Matt Ma matt.ma.um...@gmail.com wrote:

 It is defined as
 33D7;SQUARE PH;So;0;L;square 0050 0048N;SQUARED PH
 in UnicodeData.txt, but it is shown as pH in code chart. Should it be
 0070 0048 or PH?

It should certainly be pH, i.e., square0070 0048/square,
because that's the peculiar casing in widespread (universal, really)
use for this basic Chemistry concept (AFAIK it means power of
Hidrogen). See  http://en.wikipedia.org/wiki/pH#History .

While there's no surprise at PH Unicode names being all caps, I’m
surprised that the decomposition mapping is wrongly set to 0050 0048
instead of to 0070 0048.

--                                                                  .
António MARTINS-Tuválkin                                           |  ()|
tuval...@gmail.com                     Não me invejo de quem tem ||
PT-1500-111 LISBOA                       carros, parelhas e montes      |
+351 934 821 700, +351 212 463 477       só me invejo de quem bebe      |
facebook.com/profile.php?id=744658416    a água em todas as fontes      |
-
De sable uma fonte e bordadura escaqueada de jalde e goles, por timbre a
bandeira, por mote o 1º verso acima, e por grito de guerra Mi rajtas!.
-




Re: Question on U+33D7

2012-02-23 Thread Asmus Freytag

On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote:

On 2012/2/23 Matt Mamatt.ma.um...@gmail.com  wrote:


It is defined as
33D7;SQUARE PH;So;0;L;square  0050 0048N;SQUARED PH
in UnicodeData.txt, but it is shown as pH in code chart. Should it be
0070 0048 or PH?

It should certainly be pH, i.e., square0070 0048/square,
because that's the peculiar casing in widespread (universal, really)
use for this basic Chemistry concept (AFAIK it means power of
Hidrogen). See  http://en.wikipedia.org/wiki/pH#History.

While there's no surprise at PH Unicode names being all caps, I’m
surprised that the decomposition mapping is wrongly set to 0050 0048
instead of to 0070 0048.


The early fonts and code tables showed this in all caps.

Unfortunately, mappings are frozen - including mistakes.

One of the many reasons not to use NFKD or NFKC for transforming 
data - these transformations should be limited to dealing with 
identifiers, where practically all of the problematic characters are 
already disallowed.


If your intent is to sort or search a document using fuzzy 
equivalences, then you are not required to limit yourself to the NFK 
C/D transformations in any way, because you would not be claiming to be 
normalizing the text in the sense of a Unicode Normalization Form.


A./




Re: Question on U+33D7

2012-02-23 Thread Ken Whistler

On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote:

It is defined as
  33D7;SQUARE PH;So;0;L;square  0050 0048N;SQUARED PH
  in UnicodeData.txt, but it is shown as pH in code chart. Should it be
  0070 0048 or PH?

It should certainly be pH, i.e., square0070 0048/square,
because that's the peculiar casing in widespread (universal, really)
use for this basic Chemistry concept (AFAIK it means power of
Hidrogen). See  http://en.wikipedia.org/wiki/pH#History.

While there's no surprise at PH Unicode names being all caps, I’m
surprised that the decomposition mapping is wrongly set to 0050 0048
instead of to 0070 0048.


O.k., folks, I guess it's time for everybody to gather around the fire 
for another

episode of Every Character Has a Story.

First, to answer Matt Ma's original question, no, the decomposition 
should *not*
be square 0070 0048. The reason for that is simple: no matter what 
the glyph
looks like, or what people think the character might mean, the 
decomposition mapping
is immutable -- constrained by the stability guarantees for Unicode 
normalization.

U+33D7 had that decomposition mapping as of Unicode 3.1, which defines the
base for normalization stability, so right or wrong, come hell or high 
water, it

stays that way forever.

But that begs the question of how it got to be that way in the first 
place. To answer

that, we have to dig deeper into the history of the encoding.

If you will now pull down your copies of Unicode 1.0 off the shelf and 
turn to p. 362,

you will see that U+33D7 was included in Unicode 1.0. Lo and behold, the
glyph shown in the charts for U+33D7 is PH, with a capital P, rather
than a lowercase p. (The character was also named SQUARED PH, rather
than the current SQUARE PH, but the explanation for that will have to wait
for another evening.)

Unicode 1.0 didn't have any formal decompositions, but Unicode 1.*1* did.
In Unicode 1.1, on p. 75, the decomposition for U+33D7 is given as
[0050]  [0048], reflecting the glyph shown for the character in 
Unicode 1.0.


It was Unicode 2.0 which changed the glyph for U+33D7 to pH, on the 
assumption

that the character must have been intended as a East Asian square symbol
representation of the chemical symbol pH. The decomposition for U+33D7 was
not adjusted, however, although its format was shifted to square + 
0050 P + 0048 H

in the charts. Now tracking down the details of the decision process that
was involved in changing the glyph for U+33D7 for Unicode 2.0 is pretty
difficult. The development of the suite of fonts for printing Unicode 
2.0 was a pretty
wild and wooly process, as that was the first attempt to print the 
entire set of charts
with outline fonts. Unicode 1.0 had been printed with a bitmap font 
developed
at Xerox in the early early days. Some of the glyph changes between 
Unicode 1.0
and 2.0 just happened, despite the care which was taken to try to 
check everything.


I'm pretty sure that the glyph change for U+33D7 was discussed by the 
editors
at some point (in either late 1995 or very early 1996), but at that 
stage in the

development of the standard that kind of thing was usually not recorded on
an item-by-item basis. Remember, there was a *lot* going on then which was
much more important to the UTC than the glyph for some East Asian 
compatibility

character that nobody used: the design of UTF-8 for example!

Speaking of use of the character, where *did* it come from exactly, and what
was it intended for? Well, that is also problematical. *Most* of the 
characters

in the CJK Compatibility block in the range U+3380..U+33DD can easily be
traced to KS X 1001:1992 (then known as KS C 5601) or CNS 11643.
But U+33D7, U+33DA, and U+33DB are anomalous. They didn't have any
mappings (that I knew about) as of Unicode 1.0. They may have come from
some early draft of a Korean standard, or from some Asian company private
registry of character extensions, or maybe just from a paper copy of
character stuff sitting around at Xerox circa 1989. Nobody really 
seemed to

be sure what they were -- they were just more ill-advised squared East
Asian squared abbreviation dreck that was added to the pile and not
examined very carefully, because everybody knew that such symbols for
SI units (and other scientific and math symbols of their ilk, such as 
ln for

natural logarithm) should just be spelled out with regular characters.

We can presume, in hindsight, that U+33D7 *may* have been originally 
intended

as an East Asian character set abbreviation symbol for the chemical concept
of pH. U+33D9 was presumably intended for parts per million, although
I don't recall that anybody has actually bothered to think about that, 
and if

they had, they might have suggested that the glyph for *that* symbol also
be changed, to the more usual lowercase ppm. And U+33DA PR?
Who knows? My guess would be an abbreviation for per radian, as
in 57.2957 degrees per radian, but your guess is as good as mine. I suppose
it could have

Re: Question on UCA collation parameters (strength = tertiary, alternate = shifted)

2011-12-01 Thread Matt Ma
In addition, the default setting in Table 14, UTS #10, 6.0.0 are

  strength: tertiary
  alternative: shifted

But the setting won't generate the conformant behavior specified by
CollationTest_SHIFTED.txt

I think when alternative is set to shifted, strength should be set to
quaternary (as default) unless it is explicitly set.

Thanks,
Matt

On Tue, Nov 29, 2011 at 12:55 PM, Matt Ma matt.ma.um...@gmail.com wrote:
 Thanks for clarification. But to pass UCA conformance test on Shifted,
 does the strength have to be set to quaternary? Howeve, it is stated
 in UCA, C2, A conformant implementation shall support at least three
 levels of collation.

 Does this mean a UCA conformant implementation only need pass UCA
 conformance test on Non-Ignorable?

 Regards,
 Matt

 On Tue, Nov 29, 2011 at 12:49 PM, Mark Davis ☕ m...@macchiato.com wrote:
 Yes, if the strength is tertiary, then Blanked and Shifted give the same
 results.
 http://www.unicode.org/reports/tr10/proposed.html#Variable_Weighting

 Mark
 — Il meglio è l’inimico del bene —
 [https://plus.google.com/114199149796022210033]


 On Tue, Nov 29, 2011 at 19:11, Matt Ma matt.ma.um...@gmail.com wrote:

 Hi,

 Does Shifted implies strength being quaternary? If strength stays as
 tertiary (default or explicitly set), it seems the collation behavior
 is Blanked. Please clarify.

 Thanks,
 Matt








Question on UCA collation parameters (strength = tertiary, alternate = shifted)

2011-11-29 Thread Matt Ma
Hi,

Does Shifted implies strength being quaternary? If strength stays as
tertiary (default or explicitly set), it seems the collation behavior
is Blanked. Please clarify.

Thanks,
Matt



Re: Question on UCA collation parameters (strength = tertiary, alternate = shifted)

2011-11-29 Thread Matt Ma
Thanks for clarification. But to pass UCA conformance test on Shifted,
does the strength have to be set to quaternary? Howeve, it is stated
in UCA, C2, A conformant implementation shall support at least three
levels of collation.

Does this mean a UCA conformant implementation only need pass UCA
conformance test on Non-Ignorable?

Regards,
Matt

On Tue, Nov 29, 2011 at 12:49 PM, Mark Davis ☕ m...@macchiato.com wrote:
 Yes, if the strength is tertiary, then Blanked and Shifted give the same
 results.
 http://www.unicode.org/reports/tr10/proposed.html#Variable_Weighting

 Mark
 — Il meglio è l’inimico del bene —
 [https://plus.google.com/114199149796022210033]


 On Tue, Nov 29, 2011 at 19:11, Matt Ma matt.ma.um...@gmail.com wrote:

 Hi,

 Does Shifted implies strength being quaternary? If strength stays as
 tertiary (default or explicitly set), it seems the collation behavior
 is Blanked. Please clarify.

 Thanks,
 Matt







RE: Pupil's question about Burmese

2010-11-10 Thread Shawn Steele
FWIW: The OS really likes Unicode, so lots of the text input, etc, are really 
Unicode.  ANSI apps (including non-Unicode web pages), get the data back from 
those controls in ANSI, so you can lose data that it looked like you entered.  

As mentioned, the solution is to fix the app to use Unicode.  Especially for 
a language like this.  In these cases, machines will be fairly inconsistent 
even if they did support some code page, but Unicode works most everywhere.

Usually it's not difficult for a web page to switch to UTF-8.  If it's a form, 
it's even possible that overriding it on your end might get the data posted 
back in UTF-8 and succeed (if you're really lucky), but the real fix is to have 
the web server serve Unicode.

-Shawn

 
http://blogs.msdn.com/shawnste



From: unicode-bou...@unicode.org [unicode-bou...@unicode.org] on behalf of 
Peter Constable [peter...@microsoft.com]
Sent: Tuesday, November 09, 2010 10:42 PM
To: James Lin; Ed
Cc: Unicode Mailing List
Subject: RE: Pupil's question about Burmese

A non-Unicode web page is like a non-Unicode app. Web pages, and apps, should 
use Unicode.'


Peter

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Lin
Sent: Tuesday, November 09, 2010 11:24 AM
To: Ed
Cc: Unicode Mailing List
Subject: RE: Pupil's question about Burmese

Oh, don't get me wrong. By having Unicode is like wearing a crown and be a 
king.  It's best thing out there.

What I am referring is, if a web page is not Unicode supported, or any 
applications that do not support Unicode, even if running a windows 7 with 
English locale(even though natively, it supports UTF-16), it is not possible to 
directly copy/paste without having the correct supported locale, if not, you 
may damaging the bytes of the characters which show corruptions.

Even though most modern API is and hopefully written in Unicode calls, not all 
(legacy) applications are written in Unicode, so conversion is still necessary 
to even handling the non-ASCII data.

Let me know if I am still missing something here.

-Original Message-
From: Ed [mailto:ed.tra...@gmail.com]
Sent: Tuesday, November 09, 2010 11:02 AM
To: James Lin
Cc: Unicode Mailing List
Subject: Re: Pupil's question about Burmese


 Yes, displaying is fine, but the original question is copying and
 pasting; without the correct locale settings, you can’t copy/paste
 without corrupting the byte sizes.  Copy/paste is generally handle by
 OS itself, not application.  Even if you have unicode support
 application, you can display, but you can’t handle none-ASCII characters.

Why not?  Modern Win32 OSes use UTF-16.  Presumably most modern applications 
are written using calls to the modern API which should seamlessly support 
copy-and-paste of Unicode text, regardless of script or language -- so long as 
the script or language is supported at the level of displaying the text 
correctly and you have a font that works for that script.  Actually, even if 
the text display is imperfectly (i.e., one sees square boxes when lacking a 
proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex 
Text Layout script like Burmese), copy-and-paste of the raw Unicode text should 
still work correctly.

Is this not the case?




Re: Pupil's question about Burmese

2010-11-10 Thread Keith Stribley

On 11/10/2010 02:17 PM, Shawn Steele wrote:

As mentioned, the solution is to fix the app to use Unicode.  Especially for 
a language like this.  In these cases, machines will be fairly inconsistent even if they 
did support some code page, but Unicode works most everywhere.



Afaik there never has been a standard code page for Myanmar text, 
Unicode was the first time storage of Burmese text was standardised for 
computers. There are several different legacy font families in use for 
Myanmar each with their own slightly different mapping to Latin code 
points. The font in question has a Unicode cmap table, but the map is 
from Latin code points to glyphs, not from Myanmar code points to 
glyphs. There are also several fonts which map incorrectly from the 
Myanmar Unicode block using the Mon, Shan and Karen code points for 
glyph variants so the font can avoid having OpenType/Graphite/AAT rules.


If anyone is having trouble installing genuine Myanmar Unicode fonts, 
then I have some instructions at


http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/gettingStarted.php

Keith





Re: Pupil's question about Burmese

2010-11-09 Thread Ngwe Tun
Dear Peter Constable,
*
Burmese_is_supported in windows.*

It makes worse than ever to create another story like pseudo-unicode like
Zawgyi in Windows. too.

We are in dead-lock because without releasing Myanmar Opentype specifiction
for burmese by Microsoft. We can't implement burmese in opentype adopted
rendering engine like pango and harfbuzz.

We are not satisify just typing burmese text and printing burmese text. We
want to have effective use of unicode data in burmese language processing
like spelling check, machine translation and OCR.

So, Do we need system locale for Burmese? How about CultureInfo for
Microsoft .Net Framework.

I've encouraged to use Unicode standards among Myanmar Users. Myanmar Users
willing to use unicode standards in their works, personal and every
application. But there are no advantages in using Unicode Standards and CLDR
too. If Unicode.org make standards and do not apply those standards in
software and systems, how can we trust those standards. Myanmar Users do not
wait on Microsoft, Apple, Oracle implementation. They are going wrong or
breakthrough solution.

Again. I have to say caution about ethnics language. We should take care
about Mon, Shan and Karen Language which is encoded in Unicode 5.1 But
Microsoft didn't assign yet for those language in Windows 7

I'm trying to get Burmese Language Pack in Microsoft Windows .since 2002. I
gave up and no more try to get it. Microsoft not waiting stable Standards,
Politics and/or Technical. I don't not any of reason for delaying our
beloved language.

Thanks for reading it and support for 40 million speaking language. We did
petition to Microsoft at http://petition.myanmarlanguage.org/

http://my.wiktionary.org is the good dictionary site. It is started but not
yet finisned.

Best

Ngwe Tun


On Tue, Nov 9, 2010 at 8:52 AM, Peter Constable peter...@microsoft.comwrote:

 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
 Behalf Of Andrew Cunningham

  Your system locale has to handle the Burmese language.  So you need to
  either install Windows 7 in Burmese or change under Regional /
  Language options in Control panel, under Adv tab.

  well considering Burmese is a language that is not supported by Microsoft
 ... the above is relatively irrelevant.

 At whatever point Burmese _is_ supported in Windows, system locale will not
 be relevant. To be clear, the legacy Windows notion of system locale is
 relevant only in relation to apps that support only legacy Windows
 encodings, not Unicode. There is no system locale support for languages such
 as Hindi or Armenian or Khmer, but that does not prevent display of text in
 those scripts in Unicode-capable applications. So, for instance, every copy
 of Windows 2000 or later versions is capable of displaying Hindi or Armenian
 text, regardless of the system locale setting; every copy of Windows Vista
 or later is capable of displaying, in addition, text in scripts such as
 Khmer and Ethiopic; and every copy of Windows 7 is, additionally, able to
 display text in scripts Tifinagh and Tai Le. In all these cases, the system
 locale setting has no bearing.



 Peter







Re: Pupil's question about Burmese

2010-11-09 Thread Peter Edberg
Dear Ngwe Tun,
The forthcoming ICU 4.6 will include a Burmese locale (using CLDR data), with 
support for Burmese collation.
http://site.icu-project.org/

Best regards,
Peter Edberg
 
On Nov 9, 2010, at 2:05 AM, Ngwe Tun wrote:

 ...
 
 We are in dead-lock because without releasing Myanmar Opentype specifiction 
 for burmese by Microsoft. We can't implement burmese in opentype adopted  
 rendering engine like pango and harfbuzz.
 
 We are not satisify just typing burmese text and printing burmese text. We 
 want to have effective use of unicode data in burmese language processing 
 like spelling check, machine translation and OCR.
 
 ...
 
 I've encouraged to use Unicode standards among Myanmar Users. Myanmar Users 
 willing to use unicode standards in their works, personal and every 
 application. But there are no advantages in using Unicode Standards and CLDR 
 too. If Unicode.org make standards and do not apply those standards in 
 software and systems, how can we trust those standards. Myanmar Users do not 
 wait on Microsoft, Apple, Oracle implementation. They are going wrong or 
 breakthrough solution.
 
 





Re: Pupil's question about Burmese

2010-11-09 Thread James Lin
 So, for instance, every copy of Windows 2000 or later versions is capable of
displaying Hindi or Armenian text, regardless of the system locale setting;
every copy of Windows Vista or later is capable of displaying, in addition,
text in scripts such as Khmer and Ethiopic; and every copy of Windows 7 is,
additionally, able to display text in scripts Tifinagh and Tai Le. In all
these cases, the system locale setting has no bearing.

Yes, displaying is fine, but the original question is copying and pasting;
without the correct locale settings, you can¹t copy/paste without corrupting
the byte sizes.  Copy/paste is generally handle by OS itself, not
application.  Even if you have unicode support application, you can display,
but you can¹t handle none-ASCII characters.
 



On 11/8/10 6:22 PM, Peter Constable peter...@microsoft.com wrote:

 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf
 Of Andrew Cunningham
 
  Your system locale has to handle the Burmese language.  So you need to
  either install Windows 7 in Burmese or change under Regional /
  Language options in Control panel, under Adv tab.
 
  well considering Burmese is a language that is not supported by Microsoft
 ... the above is relatively irrelevant.
 
 At whatever point Burmese _is_ supported in Windows, system locale will not be
 relevant. To be clear, the legacy Windows notion of system locale is relevant
 only in relation to apps that support only legacy Windows encodings, not
 Unicode. There is no system locale support for languages such as Hindi or
 Armenian or Khmer, but that does not prevent display of text in those scripts
 in Unicode-capable applications. So, for instance, every copy of Windows 2000
 or later versions is capable of displaying Hindi or Armenian text, regardless
 of the system locale setting; every copy of Windows Vista or later is capable
 of displaying, in addition, text in scripts such as Khmer and Ethiopic; and
 every copy of Windows 7 is, additionally, able to display text in scripts
 Tifinagh and Tai Le. In all these cases, the system locale setting has no
 bearing.
 
 
 
 Peter
 
 
 



Re: Pupil's question about Burmese

2010-11-09 Thread Ed

 Yes, displaying is fine, but the original question is copying and pasting;
 without the correct locale settings, you can’t copy/paste without corrupting
 the byte sizes.  Copy/paste is generally handle by OS itself, not
 application.  Even if you have unicode support application, you can display,
 but you can’t handle none-ASCII characters.

Why not?  Modern Win32 OSes use UTF-16.  Presumably most modern
applications are written using calls to the modern API which should
seamlessly support copy-and-paste of Unicode text, regardless of
script or language -- so long as the script or language is supported
at the level of displaying the text correctly and you have a font that
works for that script.  Actually, even if the text display is
imperfectly (i.e., one sees square boxes when lacking a proper font,
or even if OpenType GPOSs and GSUBs are not correct for a Complex Text
Layout script like Burmese), copy-and-paste of the raw Unicode text
should still work correctly.

Is this not the case?




RE: Pupil's question about Burmese

2010-11-09 Thread James Lin
Oh, don't get me wrong. By having Unicode is like wearing a crown and be a 
king.  It's best thing out there.

What I am referring is, if a web page is not Unicode supported, or any 
applications that do not support Unicode, even if running a windows 7 with 
English locale(even though natively, it supports UTF-16), it is not possible to 
directly copy/paste without having the correct supported locale, if not, you 
may damaging the bytes of the characters which show corruptions.

Even though most modern API is and hopefully written in Unicode calls, not all 
(legacy) applications are written in Unicode, so conversion is still necessary 
to even handling the non-ASCII data.

Let me know if I am still missing something here.

-Original Message-
From: Ed [mailto:ed.tra...@gmail.com] 
Sent: Tuesday, November 09, 2010 11:02 AM
To: James Lin
Cc: Unicode Mailing List
Subject: Re: Pupil's question about Burmese


 Yes, displaying is fine, but the original question is copying and 
 pasting; without the correct locale settings, you can’t copy/paste 
 without corrupting the byte sizes.  Copy/paste is generally handle by 
 OS itself, not application.  Even if you have unicode support 
 application, you can display, but you can’t handle none-ASCII characters.

Why not?  Modern Win32 OSes use UTF-16.  Presumably most modern applications 
are written using calls to the modern API which should seamlessly support 
copy-and-paste of Unicode text, regardless of script or language -- so long as 
the script or language is supported at the level of displaying the text 
correctly and you have a font that works for that script.  Actually, even if 
the text display is imperfectly (i.e., one sees square boxes when lacking a 
proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex 
Text Layout script like Burmese), copy-and-paste of the raw Unicode text should 
still work correctly.

Is this not the case?




  1   2   3   4   5   >