subject:"Private Use areas"

Re: Private Use areas

2018-08-31 Thread William_J_G Overington via Unicode

Hi

I have now found the following document.

http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

William Overington

Friday 31 August 2018



Original message
>From : wjgo_10...@btinternet.com
Date : 2018/08/31 - 21:43 (GMTDT)
To : m...@kli.org, unicode@unicode.org
Subject : Re: Private Use areas

Hi

Thank you for your posts from earlier today.

Actually I learned about JSON yesterday and I am thinking that using JSON could 
well be a good idea.

I found a helpful page with diagrams.

http://www.json.org/

Although I hope that a format of recording information about the properties of 
particular uses of Private Use Area characters will become implemented as a 
practicality, and that that format can be applied in practice where desired, 
and indeed I would be happy to participate in a group project, I do not know 
enough about Unicode properties to play a major role or to lead such a project.

William Overington

Friday 31 August 2018

Re: Private Use areas

2018-08-31 Thread William_J_G Overington via Unicode

Hi

Thank you for your posts from earlier today.

Actually I learned about JSON yesterday and I am thinking that using JSON could 
well be a good idea.

I found a helpful page with diagrams.

http://www.json.org/

Although I hope that a format of recording information about the properties of 
particular uses of Private Use Area characters will become implemented as a 
practicality, and that that format can be applied in practice where desired, 
and indeed I would be happy to participate in a group project, I do not know 
enough about Unicode properties to play a major role or to lead such a project.

William Overington

Friday 31 August 2018

Re: Private Use areas

2018-08-31 Thread Mark E. Shoulson via Unicode


On 08/28/2018 04:26 AM, William_J_G Overington via Unicode wrote:

Hi
  
Mark E. Shoulson wrote:
  

I'm not sure what the advantage is of using circled characters instead of plain 
old ascii.
  
My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters.


What if circled characters are used in the text encoded in the file?  
They're characters too, people use them and all.  Whenever you designate 
some characters to be used in a way outside their normal meaning, you 
have the problem of how to use them *with* their normal meaning.  So 
there are various escaping schemes and all.  So in XML, all characters 
have their normal meanings—except <, >, and &, which mean something 
special and change the interpretations of other nearby characters (so 
"bold" is a word in English that appears in the text, but "" is 
part of an instruction to the renderer that doesn't appear in the 
text.)  And the price is that those three characters have to be 
expressed differently (< > &).  I don't really see what you 
gain by branding some large swath of unicode ("circled characters") as 
"special" and not meaning their usual selves, and for that matter making 
these hard-to-type characters *necessary* for using your scheme, when 
you could do something like what XML does, and say "everything between < 
and > is to be interpreted specially, and there, these characters have 
the following meanings" and then have some other way of expressing those 
two reserved characters.  (not saying you need to do it XML's way, but 
something like that: reserve a small number of characters that have to 
be escaped, not some huge chunk.)
  
My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format.


That's another way of saying that this is a markup format which accepts 
a large variety of plain texts.  Because you ARE talking about making a 
"particular markup format," just a different and new one.


I guess there's not even any reason for me to argue the point, though, 
since it is up to you how to design your markup language, and you can 
take advice (or not) from anyone you like.  Draw up some design, find 
some interested people, start a discussion, and work it out.  (but not 
here; this list is for discussing Unicode.)


~mark

Re: Private Use areas

2018-08-31 Thread Mark E. Shoulson via Unicode


On 08/28/2018 11:58 AM, William_J_G Overington via Unicode wrote:

Asmus Freytag wrote:


There are situations where an ad-hoc markup language seems to fulfill a need that is not 
well served by the existing full-fledged markup languages. You find them in internet 
"bulletin boards" or services like GitHub, where pure plain text is too 
restrictive but the required text styles purposefully limited - which makes the syntactic 
overhead of a full-featured mark-up language burdensome.

I am thinking of such an ad-hoc special purpose markup language.

I am thinking of something like a special purpose version of the FORTH computer 
language being used but with no user definitions, no comparison operations and 
no loops and no compiler. Just a straight run through as if someone were typing 
commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces 
between commands. For example, circled R might mean use Right-to-left text 
display.


That starts to sound no longer "ad-hoc", but that is not a well-defined 
term anyway.  You're essentially describing a special-purpose markup 
language or protocol, or perhaps even programming language.  Which is 
quite reasonable; you should (find some other interested people and) 
work out some of  the details and start writing up parsers and such

I am thinking that there could be three stacks, one for code points and one for 
numbers and one for external reference strings such as for accessing a web page 
or a PDF (Portable Document Format) document or listing an International 
Standard Book Number and so on. Code points could be entered by circled H 
followed by circled hexadecimal characters followed by a circled character to 
indicate Push onto the code point stack. Numbers could be entered in base 10, 
followed by a circled character to mean Push onto the number stack. A later 
circled character could mean to take a certain number of code points (maybe 
just 1, or maybe 0) from the character stack and a certain number of numbers 
(maybe just 1, or maybe just 0) from the number stack and use them to set some 
property.

It could all be very lightweight software-wise, just reading the characters of 
the sequence of circled characters and obeying them one by one just one time 
only on a single run through, with just a few, such as the circled digits, each 
having its meaning dependent upon a state variable such as, for a circled 
digit, whether data entry is currently hexadecimal or base 10.


I still don't see why you're fixated on using circled characters. You're 
already dealing with a markup-language type setup, why not do what other 
markup schemes do?  You reserve three or four characters and use them to 
designate when other characters are not being used in their normal sense 
but are being used as markup.  In XML, when characters are inside '<>' 
tags, they are not "plain text" of the document, but they mean other 
things—perhaps things like "right-to-left" or "reference this web page" 
and so forth, which are exactly the kinds of things you're talking about 
here.  If you don't want to use plain ascii characters because then you 
couldn't express plain ascii in your text, you're left with exactly the 
same problem with circled characters: you can't express circled 
characters in your text.  While that is a smaller problem, it can be 
eliminated altogether by various schemes used by XML or RTF or 
lightweight markup languages.  Reserve a few special characters to give 
meanings to the others, and arrange for ways to escape your handful of 
reserved characters so you can express them.  More straightforward to 
say "you have to escape <, >, and & characters" than to say "you have to 
escape all circled characters."


Anyway, this is clearly a whole new high-level protocol you need (or 
want) to work out, which would *use* Unicode (just like XML and JSON 
do), but doesn't really affect or involve it (Unicode is all about the 
"plain text".  Kind of getting off-topic, but get some people interested 
and start a mailing list to discuss it.  Good luck!


~mark

Re: CLDR (was: Private Use areas)

2018-08-31 Thread Marcel Schneider via Unicode

On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
[…]
> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> > one couldn’t simply pop them into XML or whatever, as the result would be 
> > disappointing and call for completion in the aftermath. Yet another task 
> > competing with CLDR survey.
> 
> Please elaborate. It's not clear for me what do you mean.

These comments are designed for the Code Charts and as such must not be
disproportionate in exhaustivity. Eg we have lists of related languages ending 
in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt
to be fed in an extensible and unconstrained format (without any constraint 
as of available space, number and length of comments, and so on), any lack 
is felt as a discriminating neglect, and there will be a huge rush adding data.
Yet Unicode hasn’t set up products where that data could be published, ie not 
in the Code Charts (for the abovementioned reason), not in ICU so far as the 
additional information involved does not match a known demand on user side 
(localizing software does not mean providing scholarly exhaustive information
about supported characters). The use will be in character pickers providing 
every available information about a given character. That is why Unicode is
to prioritize CLDR for CLDR users, rather than extra information for the web.

> 
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
> 
> Which XML? where?

More precisely it is LDML, the CLDR-specific XML.
What I called “digest charts” are the charts found here:

http://www.unicode.org/cldr/charts/34/

The access is via this page:

http://cldr.unicode.org/index/downloads

where the charts are in the Charts column, while the raw data is under SVN Tag.

> 
> > and we really 
> > need to go through the data and correct the many many errors, please.
> 
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.

I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
for the access to the XML data (except when knowing about SubVersioN).
Polish data is found here:

https://www.unicode.org/cldr/charts/34/summary/pl.html

The access is via the top of the "Summary" index page (showing root data):

https://www.unicode.org/cldr/charts/34/summary/root.html

You may wish to particularly check the By-Type charts:

https://www.unicode.org/cldr/charts/34/by_type/index.html

Here I’d suggest to first focus on alphabetic information and on punctuation.

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

Under Latin (table caption, without anchor) we find out what punctuation 
Polish has compared to other locales using the same script.
The exact character appears when hovering the header row.
Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
an error in almost every locale using hyphen. TC is about to correct that.

Further you will see that while Polish is using apostrophe
https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
You may wish to note that from now on, both U+0027 APOSTROPHE and 
U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
U+201D that are already found in CLDR pl.

Note however that according to the information provided by English Wikipedia:
https://en.wikipedia.org/wiki/Quotation_mark#Polish
Polish also uses single quotes, that by contrast are still missing in CLDR.

Now you might understand what I meant when pointing that there are still 
many errors in many languages in CLDR, including in English.

Best regards,

Marcel

> 
> Best regards
> 
> Janusz
> 
> -- 
> , 
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
> 
>

Re: CLDR (was: Private Use areas)

2018-08-31 Thread Manuel Strehl via Unicode

The XML files in these folders:

https://unicode.org/repos/cldr/tags/latest/common/

But I agree. I spent an extreme amount of time to get somewhat used to
cldr.unicode.org and and the data repo, and still I have no clue,
where to find a concrete piece of information without digging into the
site.
Am Fr., 31. Aug. 2018 um 07:22 Uhr schrieb Janusz S. Bień via Unicode
:
>
> On Thu, Aug 30 2018 at  2:27 +0200, unicode@unicode.org writes:
>
> [...]
>
> > Given NamesList.txt / Code Charts comments are kept minimal by design,
> > one couldn’t simply pop them into XML or whatever, as the result would be
> > disappointing and call for completion in the aftermath. Yet another task
> > competing with CLDR survey.
>
> Please elaborate. It's not clear for me what do you mean.
>
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
>
> Which XML? where?
>
> > and we really
> > need to go through the data and correct the many many errors, please.
>
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.
>
> Best regards
>
> Janusz
>
> --
>  ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>

CLDR (was: Private Use areas)

2018-08-30 Thread Janusz S. Bień via Unicode

On Thu, Aug 30 2018 at  2:27 +0200, unicode@unicode.org writes:

[...]

> Given NamesList.txt / Code Charts comments are kept minimal by design, 
> one couldn’t simply pop them into XML or whatever, as the result would be 
> disappointing and call for completion in the aftermath. Yet another task 
> competing with CLDR survey.

Please elaborate. It's not clear for me what do you mean.

> Reviewing CLDR data is IMO top priority.
> There are many flaws to be fixed in many languages including in English.
> A lot of useful digest charts are extracted from XML there,

Which XML? where?

> and we really 
> need to go through the data and correct the many many errors, please.

Some time ago I tried to have a close look at the Polish locale and
found the CLDR site prohibitively confusing.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-30 Thread Wordingham Richard via Unicode

> 
> On 29 August 2018 at 06:47 "Janusz S. Bień via Unicode" 
>  wrote:
> 
> > > 
> > Storing this information in a font, by hook or crook, would lock 
> > users
> > of those PUA characters into that font. At that rate, you might as 
> > well
> > use ASCII-hacked fonts, as we did 25 years ago.
> > 
> > > 

I don't see that at all.  The obvious way in the sfnt format, used by OpenType, 
is as a table consisting entirely of the XML file.  It is quite easy to add a 
table to an unsigned sfnt font, and even easier to extract a table consisting 
entirely of UTF-8 text, though ASCII would be even easier, from a font file.

> 
> Storing the information in a font is inappropriate not only for 
> thetechnical reasons, as I wrote recently (on Thu, Aug 23 2018)
> 
> > > 
> > Fonts are for *rendering*, new characters and variants are more and
> > more often needed for *input* of real life old texts with sufficient
> > precision.
> > 
> > > 

1. There are existing methods of associating a font with a text.  Not using a 
font needs a new scheme for associating a set of PUA properties with a portion 
of a file.  The font also serves as a code chart.  It can also hold information 
on how characters combine, which is notoriously beyond the capability of code 
charts.

2. Registries can vanish.

3. In practice, a file needs to retain an association with a specialist font.  
Preserving the font should preserve its content, but there are pruning 
techniques (e.g. WOFF2) that may remove this content.

Richard.

Re: Private Use areas

2018-08-29 Thread Marcel Schneider via Unicode

On 29/08/18 07:55, Janusz S. Bień via Unicode wrote:
> 
> On Tue, Aug 28 2018 at 9:43 -0700, unicode@unicode.org writes:
> > On August 23, 2011, Asmus Freytag wrote:
> >
> >> On 8/23/2011 7:22 AM, Doug Ewell wrote:
> >>> Of all applications, a word processor or DTP application would want
> >>> to know more about the properties of characters than just whether
> >>> they are RTL. Line breaking, word breaking, and case mapping come to
> >>> mind.
> >>>
> >>> I would think the format used by standard UCD files, or the XML
> >>> equivalent, would be preferable to making one up:
[…]
> >>
> >> The right answer would follow the XML format of the UCD.
> >>
> >> That's the only format that allows all necessary information contained
> >> in one file,
> 
> For me necessary are also comments and crossreferences contained in
> NamesList.txt. Do I understand correctly that only "ISO Comment
> properties" are included in the file?

Even that comment field is obsoleted. But it’s unclear to me what exactly 
it was providing from ISO.

> 
> >> and it would leverage of any effort that users of the
> >> main UCD have made in parsing the XML format.
> >>
> >> An XML format shold also be flexible in that you can add/remove not
> >> just characters, but properties as needed.
> >>
> >> The worst thing do do, other than designing something from scratch,
> >> would be to replicate the UnicodeData.txt layout with its random, but
> >> fixed collection of properties and insanely many semi-colons. None of
> >> the existing UCD txt files carries all the needed data in a single
> >> file.

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It’s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that’s why at some point I submitted feedback asking 
for an extension. Indeed we could use more information than what is yielded 
by UCD \setminus NamesList.txt (that we may not parse, as per file header).
Given NamesList.txt / Code Charts comments are kept minimal by design, 
one couldn’t simply pop them into XML or whatever, as the result would be 
disappointing and call for completion in the aftermath. Yet another task 
competing with CLDR survey. Reviewing CLDR data is IMO top priority.
There are many flaws to be fixed in many languages including in English.
A lot of useful digest charts are extracted from XML there, and we really 
need to go through the data and correct the many many errors, please.

Unlike XML, human readability of CSV may not be immediate. Yes you simply 
cannot always count the semicolons and remember the property name from 
the value position if it isn’t obvious by itself. But we use spreadsheets. At 
least 
some people do. That’s where the magic works. 

Looking up things in a spreadsheet is a good way to find out about wrong 
property values. Looks like handling files only programmatically gets
everything screwed up.

Marcel

Re: Private Use areas - Vertical Text

2018-08-29 Thread WORDINGHAM RICHARD via Unicode

> 
> On 29 August 2018 at 13:05 Andrew West via Unicode  
> wrote:
> 
> I tested with Word 2007, and normal PUA characters from my font were
> 
> displayed with vertical orientation in a vertical text box, but Plane
> 15 PUA characters were rotated.
> 

And then the original question is whether a font can suppress this rotation.  
For example, it is entirely possible that the rotation could be eliminated by 
the vrt2 OpenType feature mapping a Zhuang PUA glyph to an identical glyph.

Richard.

Re: Private Use areas - Vertical Text

2018-08-29 Thread Andrew West via Unicode

On Wed, 29 Aug 2018 at 11:18,  wrote:
>
> I was using a change horizontal to vertical text feature in office, the
> PUA characters being from plane 15.

I tested with Word 2007, and normal PUA characters from my font were
displayed with vertical orientation in a vertical text box, but Plane
15 PUA characters were rotated.

I also tested with Word 2016, and both normal PUA characters and Plane
15 PUA characters were displayed with vertical orientation in a
vertical text box, as you want, although there were vertical spacing
issues with the Plane 15 PUA characters which suggest that the
vertical metrics tables (vhea and vmtx) in the font are not being
applied for Plane 15 characters (or it could be a problem with my
font).

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread via Unicode


Dear Andrew,

I was using a change horizontal to vertical text feature in office, the 
PUA characters being from plane 15.


Regards
John

On 2018-08-29 16:32, Andrew West via Unicode wrote:

On Wed, 29 Aug 2018 at 05:07, via Unicode  wrote:


Yes, as Richard says when CJK Zhuang text is displayed vertically 
whilst

the Zhuang characters in Unicode remain upright, but those with PUA
codepoints are rotated 90°.


John, you did not explain by what mechanism you were trying to display
vertical PUA Zhuang text.

I can display vertically-oriented PUA-encoded CJKVZ ideographs in
vertical layout in web pages using CSS, as demonstrated in this test
page:

http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html

The PUA characters display with correct orientation under Windows 10
on the Edge, Chrome and Firefox browsers. The test page only fails
under IE, but we are not meant to use IE anymore anyway.

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread Andrew West via Unicode

On Wed, 29 Aug 2018 at 05:07, via Unicode  wrote:
>
> Yes, as Richard says when CJK Zhuang text is displayed vertically whilst
> the Zhuang characters in Unicode remain upright, but those with PUA
> codepoints are rotated 90°.

John, you did not explain by what mechanism you were trying to display
vertical PUA Zhuang text.

I can display vertically-oriented PUA-encoded CJKVZ ideographs in
vertical layout in web pages using CSS, as demonstrated in this test
page:

http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html

The PUA characters display with correct orientation under Windows 10
on the Edge, Chrome and Firefox browsers. The test page only fails
under IE, but we are not meant to use IE anymore anyway.

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread Andrew West via Unicode

On Tue, 28 Aug 2018 at 18:15, WORDINGHAM RICHARD via Unicode
 wrote:
>
> Unicode is doing what it can in this matter:
>
> (a) Zhuang PUA characters are being made individually obsolete.

Not by a nebulous entity called "Unicode", or even by the Unicode
Consortium per se, but by the hard work over many years by individual
experts such as John Knightley.

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread James Kass via Unicode

John Knightley wrote,

> Yes, as Richard says when CJK Zhuang text is displayed
> vertically whilst the Zhuang characters in Unicode remain
> upright, but those with PUA codepoints are rotated 90°.
> This is because the PUA characters are treated like English
> text, which are correctly rotated 90°. ...
>
> ...
> ... the need for PUA Zhuang characters remains, and will
> so for decades to come.

A possible work-around would be to have two fonts for PUA Zhuang, one
for horizontal text and one for vertical.  The one for the vertical
text would have the glyphs in the font pre-rotated 90° anti-clockwise.
This would require font switching when switching from horizontal to
vertical layout, of course.

Re: Private Use areas

2018-08-28 Thread Janusz S. Bień via Unicode

On Tue, Aug 28 2018 at  9:43 -0700, unicode@unicode.org writes:
> On August 23, 2011, Asmus Freytag wrote:
>
>> On 8/23/2011 7:22 AM, Doug Ewell wrote:
>>> Of all applications, a word processor or DTP application would want
>>> to know more about the properties of characters than just whether
>>> they are RTL. Line breaking, word breaking, and case mapping come to
>>> mind.
>>>
>>> I would think the format used by standard UCD files, or the XML
>>> equivalent, would be preferable to making one up:

Right. I was not so quick to state this so early, but 2 years ago I
wrote to the MUFI list:


--8<---cut here---start->8---
On Sat, Jan 02 2016 at 12:35 CET, odd.hau...@uib.no writes:

[...]

> Note the permanent URI at the University Library in Bergen. This will
> in all likelihood be the last recommendation of its kind (and
> certainly the last edited by the undersigned), so please look out for
> new solutions (databases or the like) on the MUFI web site!

I think that one of the forms, perhaps even the primary one, should
follow the original Unicode Character Database and the
output of Unibook (http://www.unicode.org/unibook/).

The idea can be tested by converting the present recommendation to this
form. Unfortunately I'm unable to contribute myself to this task.

One of the advantages would be that the various character browsers can
be adapted relatively easily to provide info about the MUFI characters.

A simpler variant of this idea is to use Unibook-like format to
document fonts. A quick-and-dirty tools for this purpose has been
prepared by a student of mine:

https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/
https://bitbucket.org/jsbien/unicode-ucd-parser

A sample output of the tools is available at

https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf

(the font is also quick-and-dirty and unfinished work).

--8<---cut here---end--->8---

Unfortunately there was no reaction.

>>
>> The right answer would follow the XML format of the UCD.
>>
>> That's the only format that allows all necessary information contained
>> in one file,

For me necessary are also comments and crossreferences contained in
NamesList.txt. Do I understand correctly that only "ISO Comment
properties" are included in the file?

>> and it would leverage of any effort that users of the
>> main UCD have made in parsing the XML format.
>>
>> An XML format shold also be flexible in that you can add/remove not
>> just characters, but properties as needed.
>>
>> The worst thing do do, other than designing something from scratch,
>> would be to replicate the UnicodeData.txt layout with its random, but
>> fixed collection of properties and insanely many semi-colons. None of
>> the existing UCD txt files carries all the needed data in a single
>> file.
>
> I don't know if or how I responded 7 years ago, but at least today, I
> think this is an excellent suggestion.
>
> If the goal is to encourage vendors to support PUA assignments, using an
> exceedingly well-defined format (UAX #42) sitting atop one of the most
> widely used base formats ever (XML), with all property information in a
> single repository (per PUA scheme), would be great encouragement.

I think we need also the data in the format acceptable by UniBook.

> I've devised lots of novel file formats and I think this is one use
> case where that would be a real hindrance.

> Storing this information in a font, by hook or crook, would lock users
> of those PUA characters into that font. At that rate, you might as well
> use ASCII-hacked fonts, as we did 25 years ago.

Storing the information in a font is inappropriate not only for the
technical reasons, as I wrote recently (on Thu, Aug 23 2018)

> Fonts are for *rendering*, new characters and variants are more and
> more often needed for *input* of real life old texts with sufficient
> precision.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

RE: Private Use areas - Vertical Text

2018-08-28 Thread via Unicode


Dear Richard and Peter,

apologies for the lack of clarity. Let me try to explain below.

On 2018-08-29 01:13, WORDINGHAM RICHARD via Unicode wrote:

On 27 August 2018 at 15:22 Peter Constable via Unicode
 wrote:

Layout engines that support CJK vertical layout do not rely on the
'vert' feature to rotate glyphs for CJK ideographs, but rather
rotate the glyph 90° and switch to using vertical glyph metrics.
The 'vert' feature is used to substitute vertical alternate glyphs
as needed, such as for punctuation that isn't automatically rotated
(and would probably need a differently-positioned alternate in any
case).

Cf. UAX 50.


There have been some pretty confused statements. I believe the
observed problem is that PUA characters for Zhuang CJK ideographs get
rotated when displayed vertically rather than left-to-right.



Yes, as Richard says when CJK Zhuang text is displayed vertically whilst 
the Zhuang characters in Unicode remain upright, but those with PUA 
codepoints are rotated 90°. This is because the PUA characters are 
treated like English text, which are correctly rotated 90°. The 
orientation of the CJK characters in this case appears to depend on 
which block they belong to. As Peter points out this does not seem to 
match UAX 50.



Unicode is doing what it can in this matter:

(a) Zhuang PUA characters are being made individually obsolete.



Yes and No. Whilst a thousand Zhuang characters have been enocoded and 
two thousand have been submitted via IRG, however the number of PUA 
Zhuang characters is about the same or increasing. In 2006 when started 
just under 6k PUA points were used, presently there are over 8k, over 6k 
of which have not been submitted, and the earliest any future 
submissions can be encoded is 2026. That being said the number of more 
common Zhuang characters needing PUA support is coming down. So whilst 
individual characters are being resolved, the need for PUA Zhuang 
characters remains, and will so for decades to come.



(b) By default, PUA characters have the value of
Vertical_orientation=upright as do CJK ideographs.



Noted above.

Regards
John


For CJK ideographs, it is not clear to me when the vert feature (if
present) would be applied.  Is it only for some codepoints (vo=tu), or
is it for all that the engine expects to be displayed 'upright' in
vertical text?  The vrtr feature (if present) would be applied when
glyphs are to be rotated.  Is it for all such glyphs, or only those
for which rotation is expected to be inadequate (vo=tr)?  It seems
that feature vrt2 is to be applied to all glyphs; perhaps rotation is
the default behaviour when there is no look-up value for a glyph that
the engine expects to be rotated.  The truly difficult case would be
when there is no attempt to apply a look-up - possibly vrtr would not
apply to /p{vo=r}.

I would expect that defining the lookup vrt2 or vrtr to map Zhuang
glyphs to themselves (or something prerotated) would cure the problem.
 This would not work for sequences of Zhuang ideographs treated as RTL
text - but that is unlikely to happen.

Richard.

RE: Private Use areas - Vertical Text

2018-08-28 Thread WORDINGHAM RICHARD via Unicode

> 
> On 27 August 2018 at 15:22 Peter Constable via Unicode 
>  wrote:
> 
> Layout engines that support CJK vertical layout do not rely on the 'vert' 
> feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° 
> and switch to using vertical glyph metrics. The 'vert' feature is used to 
> substitute vertical alternate glyphs as needed, such as for punctuation that 
> isn't automatically rotated (and would probably need a differently-positioned 
> alternate in any case).
> 
> Cf. UAX 50.
> 

There have been some pretty confused statements. I believe the observed problem 
is that PUA characters for Zhuang CJK ideographs get rotated when displayed 
vertically rather than left-to-right.

Unicode is doing what it can in this matter:

(a) Zhuang PUA characters are being made individually obsolete.

(b) By default, PUA characters have the value of Vertical_orientation=upright 
as do CJK ideographs.

For CJK ideographs, it is not clear to me when the vert feature (if present) 
would be applied.  Is it only for some codepoints (vo=tu), or is it for all 
that the engine expects to be displayed ‘upright’ in vertical text?  The vrtr 
feature (if present) would be applied when glyphs are to be rotated.  Is it for 
all such glyphs, or only those for which rotation is expected to be inadequate 
(vo=tr)?  It seems that feature vrt2 is to be applied to all glyphs; perhaps 
rotation is the default behaviour when there is no look-up value for a glyph 
that the engine expects to be rotated.  The truly difficult case would be when 
there is no attempt to apply a look-up – possibly vrtr would not apply to 
/p{vo=r}.

I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to 
themselves (or something prerotated) would cure the problem.  This would not 
work for sequences of Zhuang ideographs treated as RTL text - but that is 
unlikely to happen.

Richard.

Re: Private Use areas

2018-08-28 Thread Doug Ewell via Unicode

On August 23, 2011, Asmus Freytag wrote:

> On 8/23/2011 7:22 AM, Doug Ewell wrote:
>> Of all applications, a word processor or DTP application would want
>> to know more about the properties of characters than just whether
>> they are RTL. Line breaking, word breaking, and case mapping come to
>> mind.
>>
>> I would think the format used by standard UCD files, or the XML
>> equivalent, would be preferable to making one up:
>
> The right answer would follow the XML format of the UCD.
>
> That's the only format that allows all necessary information contained
> in one file, and it would leverage of any effort that users of the
> main UCD have made in parsing the XML format.
>
> An XML format shold also be flexible in that you can add/remove not
> just characters, but properties as needed.
>
> The worst thing do do, other than designing something from scratch,
> would be to replicate the UnicodeData.txt layout with its random, but
> fixed collection of properties and insanely many semi-colons. None of
> the existing UCD txt files carries all the needed data in a single
> file.

I don't know if or how I responded 7 years ago, but at least today, I
think this is an excellent suggestion.

If the goal is to encourage vendors to support PUA assignments, using an
exceedingly well-defined format (UAX #42) sitting atop one of the most
widely used base formats ever (XML), with all property information in a
single repository (per PUA scheme), would be great encouragement. I've
devised lots of novel file formats and I think this is one use case
where that would be a real hindrance.

Storing this information in a font, by hook or crook, would lock users
of those PUA characters into that font. At that rate, you might as well
use ASCII-hacked fonts, as we did 25 years ago.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Private Use areas

2018-08-28 Thread William_J_G Overington via Unicode

Asmus Freytag wrote:

> There are situations where an ad-hoc markup language seems to fulfill a need 
> that is not well served by the existing full-fledged markup languages. You 
> find them in internet "bulletin boards" or services like GitHub, where pure 
> plain text is too restrictive but the required text styles purposefully 
> limited - which makes the syntactic overhead of a full-featured mark-up 
> language burdensome.

I am thinking of such an ad-hoc special purpose markup language.

I am thinking of something like a special purpose version of the FORTH computer 
language being used but with no user definitions, no comparison operations and 
no loops and no compiler. Just a straight run through as if someone were typing 
commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces 
between commands. For example, circled R might mean use Right-to-left text 
display.

I am thinking that there could be three stacks, one for code points and one for 
numbers and one for external reference strings such as for accessing a web page 
or a PDF (Portable Document Format) document or listing an International 
Standard Book Number and so on. Code points could be entered by circled H 
followed by circled hexadecimal characters followed by a circled character to 
indicate Push onto the code point stack. Numbers could be entered in base 10, 
followed by a circled character to mean Push onto the number stack. A later 
circled character could mean to take a certain number of code points (maybe 
just 1, or maybe 0) from the character stack and a certain number of numbers 
(maybe just 1, or maybe just 0) from the number stack and use them to set some 
property.

It could all be very lightweight software-wise, just reading the characters of 
the sequence of circled characters and obeying them one by one just one time 
only on a single run through, with just a few, such as the circled digits, each 
having its meaning dependent upon a state variable such as, for a circled 
digit, whether data entry is currently hexadecimal or base 10.

I am wondering how many PUA property variables there would need to be set for 
the system to be useful.

The sequence could start with all of those PUA property values set at their 
default values so only those that needed changing need be explicitly set, 
though others could be explicitly set to the default values if a record were 
desired. 

William Overington

Tuesday 28 August 2018

Re: Private Use areas

2018-08-28 Thread William_J_G Overington via Unicode

James Kass wrote:

> Non-conformant?  Well, it's probably overkill anyway.  A simpler method of 
> identifying which PUA convention is being used for a file
would be to either have the first line of the file being something like 
[PUA1] or to have the file name be something like MYFILE.TXTPUA1.  
Where "PUA1" equals the CSUR.  Other numbers (PUA2, PUA3, etc.) for 
other PUA conventions.

The problem that then arises is that a registry is needed for what those 
numbers mean, such as PUA01728. So what if someone writes explaining his 
designs for glyphs for the language of the people who live in the northern part 
of the fifth planet from the sun in the science fiction novel he is writing? Is 
registration granted instantly upon request or is there a threshold of some 
sort? What if lots of people do that, including some people wanting a registry 
code number for the various emoji that they want? If there is a threshold of 
proving usage and so on, or of showing that the designs have been produced AT a 
business or AT a college or whatever, then the system will only work for some 
users of the Private Use Areas.

My opinion is that the system needs to be free-standing, with each usage 
possibly self-contained or with an external reference to a document that is 
available. Care would need to be taken to send a copy of any such document to 
deposit libraries such as The British Library so as to ensure long-term 
conservation.

William Overington

Tuesday 28 August 2018

Re: Private Use areas

2018-08-28 Thread William_J_G Overington via Unicode

Hi

Mark E. Shoulson wrote:

> I'm not sure what the advantage is of using circled characters instead of 
> plain old ascii.

My thinking is that "plain old ascii" might be used in the text encoded in the 
file. Sometimes a file containing Private Use Area characters is a mix of 
regular Unicode Latin characters with just a few Private Use Area characters 
mixed in with them. So my suggestion of using circled characters is for 
disambiguation purposes. The circled characters in the PUAINFO sequence would 
not be displayed if a special software program were being used to read in the 
text file, then act upon the information that is encoded using the circled 
characters.

My thinking is that using this method just adds some encoded information at the 
start of the text file and does not require the whole document to become 
designated as a file conformant to a particular markup format.

William Overington

Tuesday 28 August 2018

Re: Private Use areas

2018-08-28 Thread Asmus Freytag via Unicode


  
  
On 8/27/2018 2:20 PM, Rebecca
  Bettencourt via Unicode wrote:


  
  

  

  

  
> That
sounds like a non-conformant use of characters
in the U+24xx block.

Well, you are an expert on these things and I do
not understand as to with what it would be
non-conformant.
  
  

  



A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the
  characters ⓅⓊⒶⒹⒶⓉⒶ and not as a signal to process what
  follows as anything other than plain text.
  

  

Not correct.
If that was literally true, then all HTML, XML, CSS, C, C#, Java,
  Python source code files and their compilers would be
  non-conformant.
It's more like, "if a process treats a sequence of bytes as
  Unicode plain text, then the bytes corresponding to the codes
  assigned to ⓅⓊⒶⒹⒶⓉⒶ just stand for ⓅⓊⒶⒹⒶⓉⒶ. Any meaning is
  imparted by the (human) reader."
However, if the process treats the file as a source file in a
  markup language, there's nothing that prevents it from assigning
  particular interpretations to ⓅⓊⒶⒹⒶⓉⒶ, including, but not limited
  to not displaying these code points as characters.
The interpretation of the remainder of the file may well be
  conformant to the Unicode Standard, just as the display of the
  contents of many HMTL elements is usually conformant to the
  Unicode Standard.

  

  


What you are proposing is a higher-level protocol,
  whether you realize it or not. 
  

  

Correct, the rub here is that all these schemes that treat
  characters as both syntax and text depending on context amount to
  mark-up languages and are therefore ipso-facto no longer plain
  text (except if displayed as source code, but already applying
  syntax coloring would no longer be purely treating the data as
  plain text).

In-band markup has thus a dual nature as plain text and rich
  text, depending on how it is processed.



  

  
Unfortunately your higher-level protocol has a serious
  flaw in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". 
  

  

That could probably be remedied by the usual techniques.


  

  
Also, seeing a bunch of circled alphanumeric characters
  in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ.
  

  

:)


  

  


There are plenty of already-existing higher-level
  protocols (you mentioned one: XML) that could be used to
  provide information about PUA characters, and they are all
  much better suited to that purpose than what you are
  proposing.


  

  

There are situations where an ad-hoc markup language seems to
  fulfill a need that is not well served by the existing
  full-fledged markup languages. You find them in internet "bulletin
  boards" or services like GitHub, where pure plain text is too
  restrictive but the required text styles purposefully limited -
  which makes the syntactic overhead of a full-featured mark-up
  language burdensome.
Too bad that there's been no "winner" among these, and therefore
  no universally accepted one. If so, it might have presented an
  obvious target for a PUA extension.
A./

Re: Private Use areas

2018-08-27 Thread Mark E. Shoulson via Unicode

But there's nothing wrong with proposing a higher-level protocol; 
indeed, that's what Ken Whistler was saying: you need a protocol to 
transmit  this information.  It's metadata, so it will perforce be a 
higher-level protocol of some kind, whether transmitting actually 
out-of-band or reserving a piece of the file for metadata.  That's 
fine.  I'm not sure what the advantage is of using circled characters 
instead of plain old ascii.  You have to set off your reserved area 
somehow, and I don't think using circled chars is the least obtrusive 
way to do it.  You could use XML; that would be pretty well-suited to 
the task, but maybe it's overkill.  If all you need is to reference some 
"standard" PUA interpretation (per James Kass' take on this, not William 
Overington's), then just a header like "[PUA1]" would work just 
fine.  (Compare emacs with things like "-*- encoding: utf-8 -*-" or 
whatever.)


For larger chunks of meta-info, XML might be a good choice, but even 
then, it could be an XML *header* to an otherwise ordinary text file.  
Yes, you'd have to delimit it somehow, and probably have a top header (a 
"magic number") to signal the protocol, but that's doable.  For 
applications not supporting this protocol, such a setup is probably 
easier for the eye to skip past (even if it's long) than a bunch of 
circled letters.


A protocol like that is outside of Unicode's scope (just like XML is), 
but it's certainly something you could write up and try to standardize 
and get used, with or without the support of ISO. People are coming up 
with file formats all the time (and if you really want to used circled 
characters, go ahead.  That's something for you to consider in the 
design phase of the project).


~mark


On 08/27/2018 05:20 PM, Rebecca Bettencourt via Unicode wrote:


> That sounds like a non-conformant use of characters in
the U+24xx block.

Well, you are an expert on these things and I do not
understand as to with what it would be non-conformant.


A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ 
and not as a signal to process what follows as anything other than 
plain text.


What you are proposing is a higher-level protocol, whether you realize 
it or not. Unfortunately your higher-level protocol has a serious flaw 
in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". Also, seeing a bunch 
of circled alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ.


There are plenty of already-existing higher-level protocols (you 
mentioned one: XML) that could be used to provide information about 
PUA characters, and they are all much better suited to that purpose 
than what you are proposing.

Re: Private Use areas

2018-08-27 Thread Mark E. Shoulson via Unicode


On 08/27/2018 05:18 PM, James Kass via Unicode wrote:

William Overington wrote,



On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington
 wrote:


Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters
U+24B6 .. U+24E9.

Use U+2473 as if it were a circled space.

ⓌⒽⓎ◯ⓃⓄⓉ◯ⓊⓈⒺ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ◯
ⒻⓄⓇ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ？


And what's wrong with the ASCII digits?


~mark

Re: Private Use areas

2018-08-27 Thread William_J_G Overington via Unicode

James Kass wrote:

> If a user has thousands of files using PUA characters, and all the files are 
> using the same PUA convention, why would each file need to contain metadata 
> for each PUA character used within?  (Rhetorical)

Because each such file would then be self-contained and free-standing.

Such metadata need not necessarily be a huge quantity of data.

William Overington

Monday 27 August 2018

Re: Private Use areas

2018-08-27 Thread Rebecca Bettencourt via Unicode

>
> > That sounds like a non-conformant use of characters in the U+24xx block.
>
> Well, you are an expert on these things and I do not understand as to with
> what it would be non-conformant.
>
>
A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ and
not as a signal to process what follows as anything other than plain text.

What you are proposing is a higher-level protocol, whether you realize it
or not. Unfortunately your higher-level protocol has a serious flaw in that
it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". Also, seeing a bunch of circled
alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ.

There are plenty of already-existing higher-level protocols (you mentioned
one: XML) that could be used to provide information about PUA characters,
and they are all much better suited to that purpose than what you are
proposing.

Re: Private Use areas

2018-08-27 Thread James Kass via Unicode

William Overington wrote,



On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington
 wrote:

> Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters
> U+24B6 .. U+24E9.
>
> Use U+2473 as if it were a circled space.

ⓌⒽⓎ◯ⓃⓄⓉ◯ⓊⓈⒺ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ◯
ⒻⓄⓇ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ？

Re: Private Use areas

2018-08-27 Thread William_J_G Overington via Unicode

Here is the reply that I sent to Peter Constable and to the other people to 
whom he wrote.
Unlike for Mr Constable and for many other people, all of my posts have to be 
passed by the moderator, and I know why that is the situation. Though that 
situation was not imposed by a named official of Unicode Inc. acting in a 
stated official capacity.
So my opportunities to defend my ideas are conditional.
William Overington
Monday 27 August 2018
Original message
>From : wjgo_10...@btinternet.com
Date : 2018/08/27 - 21:18 (GMTDT)
To : beckie...@gmail.com, verd...@wanadoo.fr, peter...@microsoft.com, 
wjgo_10...@btinternet.com, m...@kli.org, kenwhist...@att.net, 
richard.wording...@ntlworld.com, jameskass...@gmail.com
Subject : Re: Private Use areas
Well, it is a pity that you did not send your reply to the Unicode mailing list.
> That sounds like a non-conformant use of characters in the U+24xx block.
Well, you are an expert on these things and I do not understand as to with what 
it would be non-conformant.
It seems to me that for many years some people have wanted a way to convey 
information about the meaning of Private Use Area characters used in a document 
in an unobtrusive way within the document. The format that I am suggesting 
could be the basis of a way to do that.
I really do not understand the problem.
Ken Whistler wrote:
>>> > 1. Define a *protocol* for reliable interchange of custom character 
>>> > property information about PUA code points.
Some people use XML for things where two characters are used in a different 
manner.
A quick downbeat quip comment about my ideas with no explanation is not helpful 
and might because of your standing cause some people not to consider the idea 
even-handedly for concern of offending you.
I am reminded of a British film of the 1955 called The Colditz Story. 

It used to be one of the regular films on the television years ago. 

I do not know whether it was ever shown in America, maybe, or maybe it is just 
a British thing. 

https://www.youtube.com/results?search_query=The+Colditz+story 

https://en.wikipedia.org/wiki/The_Colditz_Story 

The reason why I am reminded of that film is that one of the British prisoners 
devises a plan for a group of British prisoners to escape from Colditz 
disguised as German officers and just walk out of the gate. This is ridiculed 
as impossible because it has been tried before at various prisoner of war camps 
and the people have always been detected as British prisoners. The man 
suggesting the scheme then points out that the detection is because there is 
clearly something questionable about the direction from which the disguised 
prisoners arrive, such as from a prisoners' hut, that is the problem, not the 
quality of the disguises or the basic soundness of the idea. The man then 
suggests that they walk out of the German Officers' mess building. Please bear 
in mind that walking out of the door of the mess building does not mean 
actually being in the mess, it is a matter of going down the flight of stairs 
from a storage area, (the stairs having been accessed from under the stage of 
the castle theatre) walking past the entrance to the dining room and then out 
of the door, supposedly on their way back, after dinner, to their billets in 
the village. This done while a concert put on by some others of the prisoners, 
and attended by the senior German officers, is going on in the castle theatre. 

So, it is the bit about an idea coming from the wrong direction that reminds me 
of the film. 

https://www.youtube.com/watch?v=0eeSYvxVFUw 

https://www.youtube.com/watch?v=iY8jMkIbwDM 

https://www.youtube.com/watch?v=QxHsElyFsTI
 William Overington
Monday 27 August 2018
Original message
>From : peter...@microsoft.com
Date : 2018/08/27 - 20:33 (GMTDT)
To : wjgo_10...@btinternet.com, jameskass...@gmail.com, 
richard.wording...@ntlworld.com, m...@kli.org, beckie...@gmail.com, 
verd...@wanadoo.fr
Subject : RE: Private Use areas
That sounds like a non-conformant use of characters in the U+24xx block.

Peter

From: Unicode  On Behalf Of
William_J_G Overington via Unicode
Sent: Monday, August 27, 2018 2:00 AM
To: jameskass...@gmail.com; richard.wording...@ntlworld.com; m...@kli.org; 
beckie...@gmail.com; verd...@wanadoo.fr
Cc: unicode@unicode.org
Subject: Re: Private Use areas

Hi

How about the following method.

In a text file that contains text that uses Private Use Area characters, start 
the file with a sequence of Enclosed Alphanumeric characters from regular 
Unicode, that sequence containing the metadata relating to those Private Use 
Area characters
 as used in their present context.

http://www.unicode.org/charts/PDF/U2460.pdf

Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 
.. U+24E9.

Use U+2473 as if it were a circled space. The use of 20 to mean a space often 
occurs in web addresses. I know that there it is hex

Re: Private Use areas

2018-08-27 Thread James Kass via Unicode

Peter Constable wrote,

> That sounds like a non-conformant use of characters in the U+24xx block.

Non-conformant?  Well, it's probably overkill anyway.  A simpler
method of identifying which PUA convention is being used for a file
would be to either have the first line of the file being something
like [PUA1] or to have the file name be something like
MYFILE.TXTPUA1.  Where "PUA1" equals the CSUR.  Other numbers
(PUA2, PUA3, etc.) for other PUA conventions.

If a user has thousands of files using PUA characters, and all the
files are using the same PUA convention, why would each file need to
contain metadata for each PUA character used within?  (Rhetorical)

The "prior agreement" part about PUA usage means the user would know
in advance how to display the text properly.

RE: Private Use areas

2018-08-27 Thread Peter Constable via Unicode

This was meant to go to the list.

From: Peter Constable
Sent: Monday, August 27, 2018 12:33 PM
To: wjgo_10...@btinternet.com; jameskass...@gmail.com; 
richard.wording...@ntlworld.com; m...@kli.org; beckie...@gmail.com; 
verd...@wanadoo.fr
Subject: RE: Private Use areas

That sounds like a non-conformant use of characters in the U+24xx block.

Peter

From: Unicode mailto:unicode-boun...@unicode.org>> 
On Behalf Of William_J_G Overington via Unicode
Sent: Monday, August 27, 2018 2:00 AM
To: jameskass...@gmail.com<mailto:jameskass...@gmail.com>; 
richard.wording...@ntlworld.com<mailto:richard.wording...@ntlworld.com>; 
m...@kli.org<mailto:m...@kli.org>; 
beckie...@gmail.com<mailto:beckie...@gmail.com>; 
verd...@wanadoo.fr<mailto:verd...@wanadoo.fr>
Cc: unicode@unicode.org<mailto:unicode@unicode.org>
Subject: Re: Private Use areas

Hi

How about the following method.

In a text file that contains text that uses Private Use Area characters, start 
the file with a sequence of Enclosed Alphanumeric characters from regular 
Unicode, that sequence containing the metadata relating to those Private Use 
Area characters as used in their present context.

http://www.unicode.org/charts/PDF/U2460.pdf

Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 
.. U+24E9.

Use U+2473 as if it were a circled space. The use of 20 to mean a space often 
occurs in web addresses. I know that there it is hexadecimal and here it is 
decimal but it has the same look of being an encoded space and so that is why I 
am suggesting using it.

Start the sequence with PUAINFO encoded using seven circled Latin letters and 
any character other than a carriage return or a line feed shows that the 
sequence has ended. The use of PUAINFO encoded using seven circled Latin 
letters at the start of the sequence is so that text using enclosed 
alphanumeric characters for another purpose would not become disrupted.

Then a suitable software application can read the text file and then, either 
automatically or after the clicking of a button, extract metadata information 
from the sequence of enclosed alphanumeric characters and not display the 
sequence of enclosed alphanumeric characters.

Maybe other circled numbers in the range 10 through to 19 would have special 
meanings.

This method would keep everything within plane zero.

William Overington

Monday 27 August 2018

Original message
From : unicode@unicode.org<mailto:unicode@unicode.org>
Date : 2018/08/21 - 23:23 (GMTDT)
To : d...@ewellic.org<mailto:d...@ewellic.org>
Cc : unicode@unicode.org<mailto:unicode@unicode.org>
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode 
mailto:unicode@unicode.org>> wrote:
Ken Whistler wrote:

> The way forward for folks who want to do this kind thing is:
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points.

I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project.

As would I.

Re: Private Use areas

2018-08-27 Thread William_J_G Overington via Unicode

Hi
How about the following method.
In a text file that contains text that uses Private Use Area characters, start 
the file with a sequence of Enclosed Alphanumeric characters from regular 
Unicode, that sequence containing the metadata relating to those Private Use 
Area characters as used in their present context.
http://www.unicode.org/charts/PDF/U2460.pdf
Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 
.. U+24E9.
Use U+2473 as if it were a circled space. The use of 20 to mean a space often 
occurs in web addresses. I know that there it is hexadecimal and here it is 
decimal but it has the same look of being an encoded space and so that is why I 
am suggesting using it.
Start the sequence with PUAINFO encoded using seven circled Latin letters and 
any character other than a carriage return or a line feed shows that the 
sequence has ended. The use of PUAINFO encoded using seven circled Latin 
letters at the start of the sequence is so that text using enclosed 
alphanumeric characters for another purpose would not become disrupted.
Then a suitable software application can read the text file and then, either 
automatically or after the clicking of a button, extract metadata information 
from the sequence of enclosed alphanumeric characters and not display the 
sequence of enclosed alphanumeric characters.
Maybe other circled numbers in the range 10 through to 19 would have special 
meanings.
This method would keep everything within plane zero.
William Overington
Monday 27 August 2018
Original message
>From : unicode@unicode.org
Date : 2018/08/21 - 23:23 (GMTDT)
To : d...@ewellic.org
Cc : unicode@unicode.org
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode  
wrote:
Ken Whistler wrote:
> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 
I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 
As would I.

RE: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-27 Thread Peter Constable via Unicode

Layout engines that support CJK vertical layout do not rely on the 'vert' 
feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° 
and switch to using vertical glyph metrics. The 'vert' feature is used to 
substitute vertical alternate glyphs as needed, such as for punctuation that 
isn't automatically rotated (and would probably need a differently-positioned 
alternate in any case).

Cf. UAX 50.

Peter

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Tuesday, August 21, 2018 3:02 AM
To: unicode@unicode.org
Subject: Re: Private Use areas (was: Re: Thoughts on working with the Emoji 
Subcommittee (was ...))

On Tue, 21 Aug 2018 08:53:18 +0800
via Unicode  wrote:

> On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:

> > Still, maybe it
> > doesn't really matter much: your special-purpose font can treat any 
> > codepoint any way it likes, right?

> Not all properties come from the font. For example a Zhuang character 
> PUA font, which supplements CJK ideographs, does not rotate characters 
> 90 degrees, when change from RTL to vertical display of text.

Isn't that supposed to be treated by an OpenType feature such as 'vert'?  Or 
does the rendering stack get in the way?

However, one might need reflowing text to be about 40% WJ.

Richard.

Re: Private Use areas

2018-08-26 Thread WORDINGHAM RICHARD via Unicode



> On 21 August 2018 at 01:04 "Mark E. Shoulson via Unicode" 
>  wrote:
> 
> It is kind of a bummer, though, that you can't experiment (easily?  or at 
> all?) in the PUA with scripts that have complex behavior, or even 
> not-so-complex behavior like accents & combining marks, or RTL direction 
> (here, also, am I speaking true?  Is there a block of RTL PUA also?  I guess 
> there's always RLO, but meh.)  Still, maybe it doesn't really matter much: 
> your special-purpose font can treat any codepoint any way it likes, right?
> 
> ~mark
> 
> 
Back in 2006, I was typing the Tai Tham script (then being proposed as the 
Lanna script) using the PUA and exploring the issue of selecting between what 
are now  and  based on the preceding character and 
between what are now  and  based on the preceding base 
character and its subscripts.  I was also looking at using variation selectors 
to override the rules.  I was using SIL Graphite fonts when they was getting 
intermittent support in OpenOffice and Firefox - my main display engine was 
WorldPad.  Nowadays, SIL Graphite seems to be securely supported in LibreOffice 
and Firefox.  Now, back then, Graphite was at least attempting to support RTL; 
I would expect the RTL support to work well by now.

On the other hand, experimenting with OpenType is much harder.  The best I've 
found is transcoding to a Latin range and using an ssxx feature to convert the 
Latin glyphs back to those for the complex script.  I do that to render Tai 
Tham in Internet Explorer 11 on Windows 7; this complex scheme is a fallback 
for when the rendering engine fails.

Richard.

Re: Emacs Verbose Character Entry (was Private Use Areas)

2018-08-24 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 22:15 +0100, unicode@unicode.org writes:
> On Thu, 23 Aug 2018 21:47:03 +0200
> "Janusz S. Bień via Unicode"  wrote:
>
>> My needs are very simple, for example C-x 8 Return LATIN CAPITAL
>> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
>> the code E010. I can provide the list of names and codes.
>
> While it should obviously yield, if anything,  or
>  for 'LATIN CAPITAL LETTER A WITH MACRON AND
> BREVE',

In my opinion there is no question what

'LATIN CAPITAL LETTER A WITH MACRON AND BREVE'

should yield, because the name should be absent on the name list.

My example concerns names like

'LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI]'
'COMBINING ABBREVIATION MARK SUPERSCRIPT UR ROUND R FORM [MUFI]'

etc.

[...]

> The Emacs command "C-x 8 RET" expects the name of a single codepoint.

It's OK and in my opinion it should stay this way.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-24 Thread Janusz S. Bień via Unicode

On Fri, Aug 24 2018 at 16:12 +0300, e...@gnu.org writes:
>> From: jsb...@mimuw.edu.pl (Janusz S. Bień)
>> Cc: unicode@unicode.org,  richard.wording...@ntlworld.com
>> Date: Thu, 23 Aug 2018 21:47:03 +0200
>> 
>> I'm very glad you join the discussion.
>
> I'm sorry for not joining sooner.  In my defense, I missed the
> reference to Emacs, and the rest of the discussion is not really
> interesting for me, as using PUA for new characters is not something I
> have interest in or experience with.

I don't think you missed anything important.

>
>> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
>> A WITH MACRON AND BREVE [MUFI] should yield the character with the code
>> E010. I can provide the list of names and codes.
>
> So you'd like to extend "C-x 8 RET" to recognize names of additional
> characters and associate them with codepoints in the PUA area?  That
> shouldn't be hard to add.

I would prefer extensibility over efficiency, I don't mind loading PUA
information from a source declared somehow in .emacs.d., so I can
change/expand the list of characters from time to time.

> But is that all? won't you also want to tell Emacs about the
> properties of those characters?

Personally I would like additionally to be able to change the case of a
letter or string, and I am willing to prepare the necessary information
for MUFI characters.

Displaying other properties would be nice, but for me this is not
crucial. Moreover, somebody has to prepare the data...

> or be able to set up fonts for displaying them?

It would be nice. I haven't asked for it because I typeset my texst with
XeTeX or LuaTeX and the input is more important for me than rendering.

> IOW, would it be okay to have these
> characters be "second-class citizens" in Emacs?

For me it would be acceptable.

BTW, I just got perhaps a crazy idea: what about treating a PUA
declaration (as you probably noticed, there may be conficting ones) as a
separate coding system? Of course some mechanism for escaping the
standard PUA interpretation would be needed.

>
>> > It is true that the Unicode related data is produced at build time,
>> > but only some of that is actually recorded in the Emacs binary, the
>> > rest is loaded upon demand.  But all the data is stored in data
>> > structures that are mutable, given some Lisp programming.
>> 
>> I never was fluent in Lisp programming and by now I forgot almost
>> everything I knew, so it's not a task for me. I was thinking about
>> submitting a feature request, but I forgot also the proper procedures to
>> do it.
>
> The proper procedure is to type "M-x report-emacs-bug RET" and then
> describe the feature(s) you'd like to see added/improved.

I will definitely remember now :-)

>
>> Moreover I had the impression that I'm the only person who needs
>> it...
>
> That shouldn't stop you.  Many a feature in Emacs started as a request
> from a single individual.
>
>> > (It is not clear to me which part of the Unicode data you would like
>> > to change; are you talking about adding characters to the list of
>> > those defined by Unicode?  If you are using the PUA codepoints, it's
>> > possible that you will need to update Emacs's notion of PUA as well.)
>> 
>> Yes, I would like the PUA codepoints to be handled analogically as the
>> proper ones. What do you mean by Emacs's notion of PUA?
>
> Emacs knows about the PUA regions of the Unicode code-space, and
> treats those codepoints specially.  The features you request will
> probably need to affect the PUA region as well, because the codepoints
> you use should no longer be treated as PUA.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-24 Thread William_J_G Overington via Unicode

Hi
An approach that you might like to consider in relation to fonts is that it is 
possible to have in a font a Description field that consists of plain text.
It is stored twice in the font, in two different ways, one of which is just 
plain text, possibly just ASCII.
So if you had text such as
$$$PUAB
and so on in that Description field than a software application could search 
for all occurrences of $$$ and gather information for each set of data in that 
way, without needing separate OpenType tables.
As an example of how information can be stored in the Description field here is 
a link to a font that I made years ago.
If you download the font and open it is WordPad, the text can be read.
The direct link is as follows.
www.users.globalnet.co.uk/~ngo/SPANGBLU.TTF
The font is also linked from the following web page, about a quarter of the way 
down the page.
http://www.users.globalnet.co.uk/~ngo/fonts.htm
The web pages encoded in the font are for three of the songs linked from the 
following page.
http://www.users.globalnet.co.uk/~ngo/song0001.htm
Best regards,
William Overington
Friday 24 August 2018
Original message
>From : unicode@unicode.org
Date : 2018/08/21 - 19:23 (GMTDT)
To : unicode@unicode.org
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode 
 wrote:
I think PUA users should provide the
properties of the characters used in a form analogical to the Unicode
itself, and the software should be able to use this additional
information.
I already provide this myself for my uses of the PUA as well as the CSUR and 
any vendor-specific agreements I can find:
http://www.kreativekorp.com/charset/PUADATA/
Of course there is no way to get software to use this information. I have 
entertained the idea of being able to embed this information into the font 
itself as OpenType tables, e.g.:
PUAB -> Blocks.txt
PUAC -> CaseFolding.txt
PUAW -> EastAsianWidth.txt
PUAL -> LineBreak.txt
PUAD -> UnicodeData.txt
I've actually invented table names for the majority of UCD files, but those are 
probably the most relevant. The table names for the more obscure files get 
rather... creative, e.g.:
PUA[ -> BidiBrackets.txt
PUA] -> BidiMirroring.txt
That alone may get some people to think twice about this idea. :P

Re: Emacs Verbose Character Entry (was Private Use Areas)

2018-08-24 Thread Eli Zaretskii via Unicode

> Date: Thu, 23 Aug 2018 22:15:10 +0100
> From: Richard Wordingham via Unicode 
> 
> On Thu, 23 Aug 2018 21:47:03 +0200
> "Janusz S. Bień via Unicode"  wrote:
> 
> > My needs are very simple, for example C-x 8 Return LATIN CAPITAL
> > LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
> > the code E010. I can provide the list of names and codes.
> 
> While it should obviously yield, if anything,  or
>  for 'LATIN CAPITAL LETTER A WITH MACRON AND
> BREVE', it would probably be more important to recognise formal
> aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo
> ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao
> letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING.
> 
> For , I prefer to type "A\_M_X", but then I learnt
> XSAMPA. 

The Emacs command "C-x 8 RET" expects the name of a single codepoint.
It should be possible to extend it (or perhaps provide a separate
command) that produced named sequence of codepoints, such as those in
the above examples, but there's no such feature as of now.  If this
would be a useful addition, please suggest that on the Emacs issue
tracker (using "M-x report-emacs-bug"), and please include with your
request the sources where we could find such named sequences to
support.

Thanks.

Re: Private Use areas

2018-08-24 Thread Eli Zaretskii via Unicode

> From: jsb...@mimuw.edu.pl (Janusz S. Bień)
> Cc: unicode@unicode.org,  richard.wording...@ntlworld.com
> Date: Thu, 23 Aug 2018 21:47:03 +0200
> 
> I'm very glad you join the discussion.

I'm sorry for not joining sooner.  In my defense, I missed the
reference to Emacs, and the rest of the discussion is not really
interesting for me, as using PUA for new characters is not something I
have interest in or experience with.

> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
> A WITH MACRON AND BREVE [MUFI] should yield the character with the code
> E010. I can provide the list of names and codes.

So you'd like to extend "C-x 8 RET" to recognize names of additional
characters and associate them with codepoints in the PUA area?  That
shouldn't be hard to add.  But is that all? won't you also want to
tell Emacs about the properties of those characters? or be able to set
up fonts for displaying them?  IOW, would it be okay to have these
characters be "second-class citizens" in Emacs?

> > It is true that the Unicode related data is produced at build time,
> > but only some of that is actually recorded in the Emacs binary, the
> > rest is loaded upon demand.  But all the data is stored in data
> > structures that are mutable, given some Lisp programming.
> 
> I never was fluent in Lisp programming and by now I forgot almost
> everything I knew, so it's not a task for me. I was thinking about
> submitting a feature request, but I forgot also the proper procedures to
> do it.

The proper procedure is to type "M-x report-emacs-bug RET" and then
describe the feature(s) you'd like to see added/improved.

> Moreover I had the impression that I'm the only person who needs
> it...

That shouldn't stop you.  Many a feature in Emacs started as a request
from a single individual.

> > (It is not clear to me which part of the Unicode data you would like
> > to change; are you talking about adding characters to the list of
> > those defined by Unicode?  If you are using the PUA codepoints, it's
> > possible that you will need to update Emacs's notion of PUA as well.)
> 
> Yes, I would like the PUA codepoints to be handled analogically as the
> proper ones. What do you mean by Emacs's notion of PUA?

Emacs knows about the PUA regions of the Unicode code-space, and
treats those codepoints specially.  The features you request will
probably need to affect the PUA region as well, because the codepoints
you use should no longer be treated as PUA.

Re: Private Use areas

2018-08-24 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 11:49 -0700, beckie...@gmail.com writes:
> On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bień  wrote:
>
>  > I already provide this myself for my uses of the PUA as well as the
>  > CSUR and any vendor-specific agreements I can find:
>  >
>  > http://www.kreativekorp.com/charset/PUADATA/
>
>  I would prefer to see the data in a repository, so others can can
>  comment and contribute.
>
> That is actually my intent for the future. Though it's not quite ready yet:
>
> https://github.com/kreativekorp/charset/tree/master/puadata

Great!

>
> That's the data in a "pre-compiled" form; it's turned into a "proper"
> PUADATA directory using this script:
>
> https://github.com/kreativekorp/charset/blob/master/bin/build-public.py
>
>  As for "any vendor-specific agreements", do MUFI and LINCUA qualify?
>
> I certainly do want to see MUFI and LINCUA provided in this form, but
> I put them in a different category along with CSUR. I basically have
> three categories of PUA agreements:
>
> Fonts - PUA assignments specific to a font family, e.g. Constructium, 
> Fairfax, Nishiki-teki, Quivira, Junicode, etc.

You are probably aware that Junicode 1.000, released in September 2017,
supports in full MUFI 4.0  (released in December 2015). I don't know
whether Junicode contains now any PUA characters which are not in MUFI.

>
> Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR,
> MUFI, LINCUA, etc.
>
> Vendors - PUA assignments meant to be used by a single vendor or
> platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc.
>
> Thank you for those links by the way. I had tried to find charts for
> MUFI in the past but had somehow been unsuccessful.

Similar files for different purpose has been created by Mikkel Eide
Eriksen:

https://github.com/mikkelee/mufi-latex

An earlier version of MUFI was incorporated in the ENRICH Gaiji bank:

http://v2.manuscriptorium.com/apps/gbank/

You can download the source but it doesn't seem useful.

A version of MUFI is available also as a searchable character database
created by the present single-person MUFI board, i.e. Tarrin Wills, as a
part of the beta version of a new MUFI site:

http://skaldic.abdn.ac.uk/m.php?p=mufi

Some time ago I wrote on the mufi-fonts list:

--8<---cut here---start->8---
On Sun, Dec 03 2017 at  6:55 +0100, jsb...@mimuw.edu.pl writes:

[...]

> I wanted the file quickly to get an overview of the recently released
> corpus of 16th century Polish, and it's seemed to me that the simplest
> and fastest way is to convert the PDF recommendation in a semi-automatic
> way. It was more cumbersome than I expected, but thanks to this approach
> I've discovered a typo in the recommendation: letter I instead of digit
> 1 in EAFI, the code for LATIN ENLARGED LETTER SMALL LIGATURE AE (p. 93
> in the code chart order version).
>
> For the planned extension of the program I need more info on MUFI
> characters, preferably in the format of the UnicodeData.txt. This time
> however I intend to make haste slowly, so I have a question:
>
> Is it possible to make publicly available for download the database
> underlying http://skaldic.abdn.ac.uk/db.php?if=mufi&table=mufi_char?

--8<---cut here---end--->8---

Unfortunately I got no answer to the question.


>  > Of course there is no way to get software to use this information.
>
>  What kind of software do you have in mind?
>
> Unicode-related utilities, text editors to start with. You pretty much
> hit the nail on the head with uniname and emacs as examples. :)

Thanks! As for uniname by Bill Poser, I exchanged mails with him in
2011:

--8<---cut here---start->8---
On Sun, Aug 28 2011 at 12:01 +0200, jsb...@mimuw.edu.pl writes:

[...]

> A student of mine wrote an alternative program according to my
> specification. The program is GPLed and available with
>
> git clone http://students.mimuw.edu.pl/~findepi/unihistext unihistext

Now https://bitbucket.org/jsbien/unihistext

>
> The source is ready for Debian packaging.
>
> I think the program is worth better distribution, but its author is no
> longer interested in it. Would you be so kind to consider including
> either the program itself in your uniutils or extend your unidesc with
> its features?
>
> Best regards
>
> Janusz

On Sun, Aug 28 2011 at 16:03 -0700, billpos...@gmail.com writes:
> In principle, sure. I'll have a look at it.

--8<---cut here---end--->8---

Unfortunatelly nothing happened, and I thought I should not press the
point.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Emacs Verbose Character Entry (was Private Use Areas)

2018-08-23 Thread Richard Wordingham via Unicode

On Thu, 23 Aug 2018 21:47:03 +0200
"Janusz S. Bień via Unicode"  wrote:

> My needs are very simple, for example C-x 8 Return LATIN CAPITAL
> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
> the code E010. I can provide the list of names and codes.

While it should obviously yield, if anything,  or
 for 'LATIN CAPITAL LETTER A WITH MACRON AND
BREVE', it would probably be more important to recognise formal
aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo
ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao
letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING.

For , I prefer to type "A\_M_X", but then I learnt
XSAMPA. 

Richard.

Re: Private Use areas

2018-08-23 Thread Richard Wordingham via Unicode

On Thu, 23 Aug 2018 20:34:20 +0200
"Janusz S. Bień via Unicode"  wrote:

> This is a typical but IMHO obsolete perspective. Fonts are for
> *rendering*, new characters and variants are more and more often
> needed for *input* of real life old texts with sufficient precision.

If we're talking about glyphs which don't actually correspond to new
characters, then that sounds like a good case for private use variation
selectors. To quote Tully, "Abusus non tollit usum".

Richard.

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 22:17 +0300, e...@gnu.org writes:
>> Date: Thu, 23 Aug 2018 20:30:52 +0200
>> Cc: Richard Wordingham 
>> From: "Janusz S. Bień via Unicode" 
>> 
>> >> and in Emacs - to my disappointed it looks like the Unicode data are
>> >> set at the compile time, but perhaps this can be negotiated with the
>> >> developers.
>> >
>> > Can you be more specific?
>> 
>> I often search characters by name with C-x 8 Return. I would like to use
>> it also for MUFI characters, I have already the name list (the example
>> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
>> very closely into the problem and don't remember now the details, but my
>> impression was that it's not simple.
>
> What is "it" in the last sentence?  IOW, what is not simple about that
> with Emacs?

I'm very glad you join the discussion.

My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
A WITH MACRON AND BREVE [MUFI] should yield the character with the code
E010. I can provide the list of names and codes.

>
> It is true that the Unicode related data is produced at build time,
> but only some of that is actually recorded in the Emacs binary, the
> rest is loaded upon demand.  But all the data is stored in data
> structures that are mutable, given some Lisp programming.

I never was fluent in Lisp programming and by now I forgot almost
everything I knew, so it's not a task for me. I was thinking about
submitting a feature request, but I forgot also the proper procedures to
do it. Moreover I had the impression that I'm the only person who needs
it...

>
> (It is not clear to me which part of the Unicode data you would like
> to change; are you talking about adding characters to the list of
> those defined by Unicode?  If you are using the PUA codepoints, it's
> possible that you will need to update Emacs's notion of PUA as well.)

Yes, I would like the PUA codepoints to be handled analogically as the
proper ones. What do you mean by Emacs's notion of PUA?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Eli Zaretskii via Unicode

> Date: Thu, 23 Aug 2018 20:30:52 +0200
> Cc: Richard Wordingham 
> From: "Janusz S. Bień via Unicode" 
> 
> >> and in Emacs - to my disappointed it looks like the Unicode data are
> >> set at the compile time, but perhaps this can be negotiated with the
> >> developers.
> >
> > Can you be more specific?
> 
> I often search characters by name with C-x 8 Return. I would like to use
> it also for MUFI characters, I have already the name list (the example
> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
> very closely into the problem and don't remember now the details, but my
> impression was that it's not simple.

What is "it" in the last sentence?  IOW, what is not simple about that
with Emacs?

It is true that the Unicode related data is produced at build time,
but only some of that is actually recorded in the Emacs binary, the
rest is loaded upon demand.  But all the data is stored in data
structures that are mutable, given some Lisp programming.

(It is not clear to me which part of the Unicode data you would like
to change; are you talking about adding characters to the list of
those defined by Unicode?  If you are using the PUA codepoints, it's
possible that you will need to update Emacs's notion of PUA as well.)

Re: Private Use areas

2018-08-23 Thread Rebecca Bettencourt via Unicode

On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bień  wrote:

> > I already provide this myself for my uses of the PUA as well as the
> > CSUR and any vendor-specific agreements I can find:
> >
> > http://www.kreativekorp.com/charset/PUADATA/
>
> I would prefer to see the data in a repository, so others can can
> comment and contribute.
>

That is actually my intent for the future. Though it's not quite ready yet:

https://github.com/kreativekorp/charset/tree/master/puadata

That's the data in a "pre-compiled" form; it's turned into a "proper"
PUADATA directory using this script:

https://github.com/kreativekorp/charset/blob/master/bin/build-public.py

As for "any vendor-specific agreements", do MUFI and LINCUA qualify?
>

I certainly do want to see MUFI and LINCUA provided in this form, but I put
them in a different category along with CSUR. I basically have three
categories of PUA agreements:

Fonts - PUA assignments specific to a font family, e.g. Constructium,
Fairfax, Nishiki-teki, Quivira, Junicode, etc.

Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR, MUFI,
LINCUA, etc.

Vendors - PUA assignments meant to be used by a single vendor or platform,
e.g. Adobe, Apple, etc. but also Linux, MirOS, etc.

Thank you for those links by the way. I had tried to find charts for MUFI
in the past but had somehow been unsuccessful.

> Of course there is no way to get software to use this information.
>
> What kind of software do you have in mind?
>

Unicode-related utilities, text editors to start with. You pretty much hit
the nail on the head with uniname and emacs as examples. :)

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 17:26 +0100, unicode@unicode.org writes:
> On Thu, 23 Aug 2018 17:39:15 +0200
> Philippe Verdy via Unicode  wrote:
>
>> You make a confusion: I do not propose "hacking" existing codes, but
>> instead adding new codes for private variations. It's then up to PUV
>> sequence authors to choose an appropropriate base character that can
>> have the properties they want to be inherited by the private-use
>> variation sequence, or to choose a base character that will provide
>> some reasonnable reading if rendererd as is (by renderers or fonts
>> not implementing the pricate viaration sequence, give nthat they will
>> also append a symbol for the PUV itself after the standard character).
>
> Variation sequences cannot be used to add new characters.  Most PUA
> characters are used to represent new characters.  A
> standard-conformant private variation sequence would generally achieve
> the same effect as could be achieved by a font feature (typically one
> of the cvxx, though possibly one of the ssxx),

This is a typical but IMHO obsolete perspective. Fonts are for
*rendering*, new characters and variants are more and more often needed
for *input* of real life old texts with sufficient precision.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 17:11 +0100, unicode@unicode.org writes:
> On Thu, 23 Aug 2018 14:10:35 +0200
> "Janusz S. Bień via Unicode"  wrote:
>
>> What kind of software do you have in mind?
>> 
>> I'm primarily interested in the locally developed programs
>> 
>> https://bitbucket.org/jsbien/unihistext/
>> 
>> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/
>
> It looks as though the security certificates are awry - has someone
> forgotten to pay the protection money to the right people?  (Firefox
> objects with "The page you are trying to view cannot be shown because
> the authenticity of the received data could not be verified.")

I see no such problems with Firefox ESR 52.9.0 on Debian
testing. Moreover the program reports that the certificate is valid till
04/21/2020.

>
>> and in Emacs - to my disappointed it looks like the Unicode data are
>> set at the compile time, but perhaps this can be negotiated with the
>> developers.
>
> Can you be more specific?

I often search characters by name with C-x 8 Return. I would like to use
it also for MUFI characters, I have already the name list (the example
directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
very closely into the problem and don't remember now the details, but my
impression was that it's not simple.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Philippe Verdy via Unicode

Le jeu. 23 août 2018 à 18:31, Richard Wordingham via Unicode <
unicode@unicode.org> a écrit :

> On Thu, 23 Aug 2018 17:39:15 +0200
> Philippe Verdy via Unicode  wrote:
>
> > You make a confusion: I do not propose "hacking" existing codes, but
> > instead adding new codes for private variations. It's then up to PUV
> > sequence authors to choose an appropropriate base character that can
> > have the properties they want to be inherited by the private-use
> > variation sequence, or to choose a base character that will provide
> > some reasonnable reading if rendererd as is (by renderers or fonts
> > not implementing the pricate viaration sequence, give nthat they will
> > also append a symbol for the PUV itself after the standard character).
>
> Variation sequences cannot be used to add new characters.


Did you remember I did not speak about existing variation sequences ? Only
about the new encocing do provite use variation sequences which do not have
to obey the policy of exising VS, and whose purpose whould be to inherit
most properties (notably direction, breaking, spacing, general category of
another existing character).



> Most PUA
> characters are used to represent new characters.


I did not speak as well about PUAs.

Re: Private Use areas

2018-08-23 Thread Richard Wordingham via Unicode

On Thu, 23 Aug 2018 17:39:15 +0200
Philippe Verdy via Unicode  wrote:

> You make a confusion: I do not propose "hacking" existing codes, but
> instead adding new codes for private variations. It's then up to PUV
> sequence authors to choose an appropropriate base character that can
> have the properties they want to be inherited by the private-use
> variation sequence, or to choose a base character that will provide
> some reasonnable reading if rendererd as is (by renderers or fonts
> not implementing the pricate viaration sequence, give nthat they will
> also append a symbol for the PUV itself after the standard character).

Variation sequences cannot be used to add new characters.  Most PUA
characters are used to represent new characters.  A
standard-conformant private variation sequence would generally achieve
the same effect as could be achieved by a font feature (typically one
of the cvxx, though possibly one of the ssxx), though using font
features would be fiddlier and have more limited support, and variation
sequences would facilitate data processing.

Richard.

Re: Private Use areas

2018-08-23 Thread Richard Wordingham via Unicode

On Thu, 23 Aug 2018 14:10:35 +0200
"Janusz S. Bień via Unicode"  wrote:

> What kind of software do you have in mind?
> 
> I'm primarily interested in the locally developed programs
> 
> https://bitbucket.org/jsbien/unihistext/
> 
> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/

It looks as though the security certificates are awry - has someone
forgotten to pay the protection money to the right people?  (Firefox
objects with "The page you are trying to view cannot be shown because
the authenticity of the received data could not be verified.")

> and in Emacs - to my disappointed it looks like the Unicode data are
> set at the compile time, but perhaps this can be negotiated with the
> developers.

Can you be more specific?  For Indic rearrangement I had to define
syllables myself with definitions which I then added to
composition-function-table.  Unfortunately, I then hit the problem
that I had to define Indic rearrangement myself, and OpenType fonts
fall into several incompatible families, which is why I haven't
released a general solution.  My emacs kit for Tai Tham is given via
http://www.wrdingham.co.uk/lanna/toolkit.html (a probable kinsman got
the 'o'), but there are a lot of odds and ends that need sorting out.

I would expect that you would be able to override any relevant
'compiler' settings via your Emacs start up file - I expect Eli
Zaretski will be along soon with more details.  Of course, you could
always revert to the old tradition and recompile Emacs yourself -
though it may need something like MinGW to compile for Windows.

Richard.

Re: Private Use areas

2018-08-23 Thread Philippe Verdy via Unicode

You make a confusion: I do not propose "hacking" existing codes, but
instead adding new codes for private variations. It's then up to PUV
sequence authors to choose an appropropriate base character that can have
the properties they want to be inherited by the private-use variation
sequence, or to choose a base character that will provide some reasonnable
reading if rendererd as is (by renderers or fonts not implementing the
pricate viaration sequence, give nthat they will also append a symbol for
the PUV itself after the standard character).

Also I do not want to change anything to any existing variation sequences
(using VS1 and so on) and their encoding policies, requiring a prior
registration and standardisation.

Le jeu. 23 août 2018 à 11:42, Richard Wordingham via Unicode <
unicode@unicode.org> a écrit :

> On Wed, 22 Aug 2018 11:58:58 +0200
> Philippe Verdy via Unicode  wrote:
>
> > For now there's still no way to have variant sequences unless they are
> > registered and standardized by Unicode but registration should be not
> > needed (forbidden) for sequences containing PUV.
>
> I believe this scheme is no worse than hack encodings that using Latin
> character codes for other characters.  These schemes often work.
> (Indeed, the currently best method of getting Tai Tham displayed as rich
> text that I can find is to use a transliteration-type encoding and a
> special font, though I can now get pretty close using the proper
> character codes in the order laid down in the proposals.)
>
> The major problems I can see with appropriating variation sequences
> are:
> (1) It might be restricted to base characters - I have no
> experimental evidence on whether this would happen.  Fonts can happily
> convert base characters to combining characters, though this works
> best if Latin line-breaking rules take effect.
>
> (2) The appropriated variation sequence might be assigned a meaning -
> but this is no worse than the general ambiguity of PUA characters.
>
> (3) Some base characters get special treatment.  For example, I had
> to change my transliteration scheme because hyphen-minus is treated
> specially by MS Edge - I was using it as a digraph disjunctor - and
> so clusters were not being formed.  In this case, I would have come
> unstuck as soon as line-wrapping started, so it was a bad choice anyway.
>
> Or are there significant renderers that deliberately ignore variation
> selectors in unregistered, unstandardised variation sequences?  I don't
> recall any problems from when we were discussing variation
> sequences for chess pieces.
>
> For supplementing a script, it might be best to start at
> VARIATION-SELECTOR-256, and work down if need be with specialist
> characters.
>
> Richard.
>

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Tue, Aug 21 2018 at 11:23 -0700, unicode@unicode.org writes:
> On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode 
>  wrote:
>
>  I think PUA users should provide the
>  properties of the characters used in a form analogical to the Unicode
>  itself, and the software should be able to use this additional
>  information.
>
> I already provide this myself for my uses of the PUA as well as the
> CSUR and any vendor-specific agreements I can find:
>
> http://www.kreativekorp.com/charset/PUADATA/

I would prefer to see the data in a repository, so others can can
comment and contribute.

As for "any vendor-specific agreements", do MUFI and LINCUA qualify?

https://folk.uib.no/hnooh/mufi/
http://andron-typeforum.xobor.de/t10f13-Towards-a-linguistic-corporate-use-area-LINCUA.html

>
> Of course there is no way to get software to use this information.

What kind of software do you have in mind?

I'm primarily interested in the locally developed programs

https://bitbucket.org/jsbien/unihistext/

https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/

and in Emacs - to my disappointed it looks like the Unicode data are set
at the compile time, but perhaps this can be negotiated with the
developers.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Richard Wordingham via Unicode

On Wed, 22 Aug 2018 11:58:58 +0200
Philippe Verdy via Unicode  wrote:

> For now there's still no way to have variant sequences unless they are
> registered and standardized by Unicode but registration should be not
> needed (forbidden) for sequences containing PUV.

I believe this scheme is no worse than hack encodings that using Latin
character codes for other characters.  These schemes often work.
(Indeed, the currently best method of getting Tai Tham displayed as rich
text that I can find is to use a transliteration-type encoding and a
special font, though I can now get pretty close using the proper
character codes in the order laid down in the proposals.)

The major problems I can see with appropriating variation sequences
are:
(1) It might be restricted to base characters - I have no
experimental evidence on whether this would happen.  Fonts can happily
convert base characters to combining characters, though this works
best if Latin line-breaking rules take effect.

(2) The appropriated variation sequence might be assigned a meaning -
but this is no worse than the general ambiguity of PUA characters.

(3) Some base characters get special treatment.  For example, I had
to change my transliteration scheme because hyphen-minus is treated
specially by MS Edge - I was using it as a digraph disjunctor - and
so clusters were not being formed.  In this case, I would have come
unstuck as soon as line-wrapping started, so it was a bad choice anyway.

Or are there significant renderers that deliberately ignore variation
selectors in unregistered, unstandardised variation sequences?  I don't
recall any problems from when we were discussing variation
sequences for chess pieces.

For supplementing a script, it might be best to start at
VARIATION-SELECTOR-256, and work down if need be with specialist
characters.

Richard.

Re: Private Use areas

2018-08-22 Thread Philippe Verdy via Unicode

May be this debate could find an end if there was a way to encode "private
use variants", so that we can override an existing character with correct
properties by creating a custom variant, which would immediately inherit
the properties of the base character on which it is encoded.

But for now there's no private use variant codes (PUV). I think that a
small block of 16 codes (may be even less) would be largely enough (given
that it would be used only in pairs after any standard character). They
could be used after any base character, possibly even after a combining
character (so the default combining class for these PUV should be 0).

For now there's still no way to have variant sequences unless they are
registered and standardized by Unicode but registration should be not
needed (forbidden) for sequences containing PUV.

I think there's a usage pattern for such schemes. Their default (spacing)
glyph could be a dotted circle with a single hex digit inside, it would be
itself non-joining, it would be itself bidi-neutral and used only after a
base character from which it would inherit the directionality (so the glyph
would appear automatically on the correct side). Actual fonts implementing
these PUV sequences would treat the PUV sequences as distinct unbreakable
entities  mapped to their own abstract character, and subject to common
ligation.

Le mer. 22 août 2018 à 04:58, Andrew Cunningham via Unicode <
unicode@unicode.org> a écrit :

>
>
> On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode <
> unicode@unicode.org> wrote:
>
>> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>>
>>>
>>>
>> Best we can do is shout loudly at OpenType tables and hope to cram in
>> behavior (or at least appearance, which is more likely all we can get) that
>> vaguely resembles what we're after.  And that's not SO awful, given what
>> we're dealing with.
>>
>>>
>>>
> At the moment I am looking at implementing three unencoded Arabic
> characters in  the PUA.
>
> For the foreseeable future OpenType is a non-starter, so I will look at
> implementing them in Graphite tables in a font.
>
> Andrew
>
>
>
> --
> Andrew Cunningham
> lang.supp...@gmail.com
>
>
>
>

Re: Private Use areas

2018-08-21 Thread Andrew Cunningham via Unicode

On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode <
unicode@unicode.org> wrote:

> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>
>>
>>
> Best we can do is shout loudly at OpenType tables and hope to cram in
> behavior (or at least appearance, which is more likely all we can get) that
> vaguely resembles what we're after.  And that's not SO awful, given what
> we're dealing with.
>
>>
>>
At the moment I am looking at implementing three unencoded Arabic
characters in  the PUA.

For the foreseeable future OpenType is a non-starter, so I will look at
implementing them in Graphite tables in a font.

Andrew



-- 
Andrew Cunningham
lang.supp...@gmail.com

Re: Private Use areas

2018-08-21 Thread Mark E. Shoulson via Unicode


On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:



On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:

On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:

Is there a block of RTL PUA also?

No.

Perhaps there should be?


This is a periodic suggestion that never goes anywhere--for good 
reason. (You can search the email archives and see that it keeps 
coming up.)


Presuming that this question was asked in good faith...


Yeah, I know there has been talk about such things, and I also knew that 
whether or not there was an RTL block (which I did not remember for 
certain), there weren't going to be any *changes* in the PUA, and we 
were going to have to make do with what there was.  There's no way to 
anticipate all the possible properties people would want in the PUA, 
though I remember thinking it was probably wrong to make the PUA 
*strongly* LTR; I know there's a not-strongly flavor too.


Best we can do is shout loudly at OpenType tables and hope to cram in 
behavior (or at least appearance, which is more likely all we can get) 
that vaguely resembles what we're after.  And that's not SO awful, given 
what we're dealing with.




As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People 
who assume this is somehow easy, and that the UTC are a bunch of 
boneheads who stand in the way of obvious solutions, do not -- I 
contend -- understand the complicated interplay of character 
properties, stability guarantees, and implementation behavior baked 
into system support libraries for the Unicode Standard.


The whole point of the PUA is that it *isn't* standardized (by the 
UTC).  It might have been nice to make some more varied choices of 
things that couldn't be left unspecified, but you're still going to wind 
up with "but there aren't any PUA codepoints that are JUST what I 
need!"  And, as said, it's too late now.


~mark

Re: Private Use areas

2018-08-21 Thread Rebecca Bettencourt via Unicode

On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode 
wrote:

> Ken Whistler wrote:
>
> > The way forward for folks who want to do this kind thing is:
> >
> > 1. Define a *protocol* for reliable interchange of custom character
> > property information about PUA code points.
>
> I've often thought that would be a great idea. You can't get to steps 2
> and 3 without step 1. I'd gladly participate in such a project.
>

As would I.

Re: Private Use areas

2018-08-21 Thread Doug Ewell via Unicode

Ken Whistler wrote:

> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 

I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Private Use areas

2018-08-21 Thread Adam Borowski via Unicode

On Tue, Aug 21, 2018 at 11:03:41AM -0700, Ken Whistler via Unicode wrote:
> 
> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:
> > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
> > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > > > Is there a block of RTL PUA also?
> > > No.
> > Perhaps there should be?
> 
> This is a periodic suggestion that never goes anywhere--for good reason.
> (You can search the email archives and see that it keeps coming up.)
> 
> Presuming that this question was asked in good faith...

Oif, looks like mere months of inattentive lurking are not enough (the
thread I got pointed to was from 2011).  Apologies.

> > or perhaps by allocating a new range elsewhere.
> See:
> 
> https://www.unicode.org/policies/stability_policy.html
> 
> The General_Category property value Private_Use (Co) is immutable: the set
> of code points with that value will never change.
> 
> That guarantee has been in place since 1996, and is a rule that binds the
> UTC. So nope, sorry, no more PUA ranges.

Right.

> The way forward for folks who want to do this kind thing is:
> 
> 1. Define a *protocol* for reliable interchange of custom character property
> information about PUA code points.
[...]
> And if the goal for #3 is to get some *system* implementer to support the
> protocol in widespread software, then before starting any of #1, #2, or #3,
> you had better start instead with:
> 
> 0. Create a consortium (or other ongoing organization) with a 10-year time
> horizon and participation by at least one major software implementer, to
> define, publicize, and advocate for support of the protocol.

Heh, good point.  I wonder, perhaps a long-lived consortium tasked with
assigning properties to characters already exists?

So your answer _does_ provide a way to go: any PUA use that's no longer
private, or any problem someone has with character properties, should go
through official channels here instead of inventing an own standard.

With my existing hats on (Debian fonts team member, and someone who messes
with terminals in general) I already have two such itches to scratch.
Thus, it sounds like I should do the research, prepare a write-up, and then
come back to harass you folks with inane questions.  Inventing new solutions
that work around instead of with you is a bad idea...

Meow!
-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ

Re: Private Use areas

2018-08-21 Thread Richard Wordingham via Unicode

On Tue, 21 Aug 2018 11:03:41 -0700
Ken Whistler via Unicode  wrote:

> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

> Really? Suppose someone wants to implement a bicameral script in PUA. 
> They would need case mappings for that, and how would those be
> "better represented in the font itself"? Or how about digits? Would
> numeric values for digits be "better represented in the font itself"?
> How about implementation of punctuation? Would segmentation
> properties and behavior be "better represented in the font itself"?

The least intrusive way of defining the meaning of a graphic (sensu
lato) character is by a font, in a very wide sense that would interpret
a Unicode code chart as a font.  Without a font in this sense, normal
characters in the PUA have no meaning.  If one insists on a font to
have an interpretation, then:

(1) PUA characters in plain text are meaningless - I believe that's
pretty much the position now.

(2) Different schemes can co-exist, even within the same formatted
document, by having different formats.  This is the case now.  It then
makes sense to store the properties in the font, which needs to be
saved with or in the document for the document to continue to make
sense. 

Casing and digits are luxuries.  Are we not told that searching should
be done by collation?  We then do not need case-folding!  Interpreting
the preferred representation of Roman numerals does not use Unicode
properties beyond the approximate principle of one character, one
codepoint. 

As to segmentation, my understanding was that there were no characters
available to indicate word boundaries in scriptio continua; the closest
one has is line-breaking suggestions.  If my memory serves me right,
SIL Graphite fonts can hold line-breaking information.

Richard.

Re: Private Use areas

2018-08-21 Thread Rebecca Bettencourt via Unicode

On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode <
unicode@unicode.org> wrote:

> I think PUA users should provide the
> properties of the characters used in a form analogical to the Unicode
> itself, and the software should be able to use this additional
> information.
>

I already provide this myself for my uses of the PUA as well as the CSUR
and any vendor-specific agreements I can find:

http://www.kreativekorp.com/charset/PUADATA/

Of course there is no way to get software to use this information. I have
entertained the idea of being able to embed this information into the font
itself as OpenType tables, e.g.:

PUAB -> Blocks.txt
PUAC -> CaseFolding.txt
PUAW -> EastAsianWidth.txt
PUAL -> LineBreak.txt
PUAD -> UnicodeData.txt

I've actually invented table names for the majority of UCD files, but those
are probably the most relevant. The table names for the more obscure files
get rather... creative, e.g.:

PUA[ -> BidiBrackets.txt
PUA] -> BidiMirroring.txt

That alone may get some people to think twice about this idea. :P

Re: Private Use areas

2018-08-21 Thread Ken Whistler via Unicode



On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:

On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:

Is there a block of RTL PUA also?

No.

Perhaps there should be?


This is a periodic suggestion that never goes anywhere--for good reason. 
(You can search the email archives and see that it keeps coming up.)


Presuming that this question was asked in good faith...



What about designating a part of the PUA to have a specific property?


The problem with that is that assigning *any* non-default property to 
any PUA code point would break existing implementations' assumptions 
about PUA character properties and potentially create havoc with 
existing use.



Only certain properties matter enough:


That is an un-demonstrated assertion that I don't think you have thought 
through sufficiently.



* wide
* RTL


RTL is not some binary counterpart of LTR. There are 23 values of 
Bidi_Class, and anyone who wanted to implement a right-to-left script in 
PUA might well have to make use of multiple values of Bidi_Class. Also, 
there are two major types of strong right-to-leftness: Bidi_Class=R and 
Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or 
non-Arabic type behavior?



* combining


Also not a binary switch. Canonical_Combining_Class is a numeric value, 
and any value but ccc=0 for a PUA character would break normalization. 
Then for the General_Category, there are three types of "marks" that 
count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored 
in any PUA assignment?



as most others are better represented in the font itself.


Really? Suppose someone wants to implement a bicameral script in PUA. 
They would need case mappings for that, and how would those be "better 
represented in the font itself"? Or how about digits? Would numeric 
values for digits be "better represented in the font itself"? How about 
implementation of punctuation? Would segmentation properties and 
behavior be "better represented in the font itself"?




This could be done either by parceling one of existing PUA ranges: planes 15
and 16 are virtually unused thus any damage would be negligible;


That is simply an assertion -- and not the kind of assertion that the 
UTC tends to accept on spec. I rather suspect that there are multiple 
participants on this email list, for example, who *do* have 
implementations making extensive use of Planes 15/16 PUA code points for 
one thing or another.



  or perhaps
by allocating a new range elsewhere.

See:

https://www.unicode.org/policies/stability_policy.html

The General_Category property value Private_Use (Co) is immutable: the 
set of code points with that value will never change.


That guarantee has been in place since 1996, and is a rule that binds 
the UTC. So nope, sorry, no more PUA ranges.

Meow!


Grrr! ;-)

As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People who 
assume this is somehow easy, and that the UTC are a bunch of boneheads 
who stand in the way of obvious solutions, do not -- I contend -- 
understand the complicated interplay of character properties, stability 
guarantees, and implementation behavior baked into system support 
libraries for the Unicode Standard.


The way forward for folks who want to do this kind thing is:

1. Define a *protocol* for reliable interchange of custom character 
property information about PUA code points.


2. Convince more than one party to actually *use* that protocol to 
define sets of interchangeable character property definitions.


3. Convince at least one implementer to support that protocol to create 
some relevant interchangeable *behavior* for those PUA characters.


And if the goal for #3 is to get some *system* implementer to support 
the protocol in widespread software, then before starting any of #1, #2, 
or #3, you had better start instead with:


0. Create a consortium (or other ongoing organization) with a 10-year 
time horizon and participation by at least one major software 
implementer, to define, publicize, and advocate for support of the 
protocol. (And if you expect a major software implementer to 
participate, you might need to make sure you have a business case 
defined that would warrant such a 10-year effort!)


--Ken

Re: Private Use areas

2018-08-21 Thread Steven R. Loomis via Unicode

2011 Thread:
https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0124.html

Please read in particular these two:

- https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0174.html
- https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0212.html

(tl;dr: 1. the PUA set is fixed, 2. being private, the properties may be
overridable by conformant implementations.)

On Mon, Aug 20, 2018 at 5:17 PM Ken Whistler via Unicode <
unicode@unicode.org> wrote:

>
>
> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > Is there a block of RTL PUA also?
>
> No.
>
> --Ken
>

Re: Private Use areas

2018-08-21 Thread Janusz S. Bień via Unicode

On Tue, Aug 21 2018 at 16:56 +0200, unicode@unicode.org writes:
> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
>> > Is there a block of RTL PUA also?
>> 
>> No.
>
> Perhaps there should be?
>
> What about designating a part of the PUA to have a specific property?  Only
> certain properties matter enough:
> * wide
> * RTL
> * combining
> as most others are better represented in the font itself.
>
> This could be done either by parceling one of existing PUA ranges: planes 15
> and 16 are virtually unused thus any damage would be negligible; or perhaps
> by allocating a new range elsewhere.

I don't think it's a good idea. I think PUA users should provide the
properties of the characters used in a form analogical to the Unicode
itself, and the software should be able to use this additional
information.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-21 Thread Adam Borowski via Unicode

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > Is there a block of RTL PUA also?
> 
> No.

Perhaps there should be?

What about designating a part of the PUA to have a specific property?  Only
certain properties matter enough:
* wide
* RTL
* combining
as most others are better represented in the font itself.

This could be done either by parceling one of existing PUA ranges: planes 15
and 16 are virtually unused thus any damage would be negligible; or perhaps
by allocating a new range elsewhere.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]

Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-21 Thread Richard Wordingham via Unicode

On Tue, 21 Aug 2018 08:53:18 +0800
via Unicode  wrote:

> On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:

> > Still, maybe it
> > doesn't really matter much: your special-purpose font can treat any
> > codepoint any way it likes, right?

> Not all properties come from the font. For example a Zhuang character 
> PUA font, which supplements CJK ideographs, does not rotate
> characters 90 degrees, when change from RTL to vertical display of
> text.

Isn't that supposed to be treated by an OpenType feature such as
'vert'?  Or does the rendering stack get in the way?

However, one might need reflowing text to be about 40% WJ.

Richard.

Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread William_J_G Overington via Unicode

Doug Ewell wrote:

> Yes, you run the risk of someone else's PUA implementation colliding with 
> yours. That's why you create a Private Use Agreement, and make sure it's 
> prominently available to people who want to use your solution. It's not like 
> there are hundreds of PUA schemes anyway.

Yes, that is generally true. However, a situation where that does not matter is 
if one just wishes to include some specially designed glyphs of one's own 
design in a PDF (Portable Document Format) document and one uses a Private Use 
Area encoding simply so that the PDF document with a subset of the glyphs of 
the font embedded in the PDF can be produced using a desktop publishing 
program. That is, one makes the font, one installs the font, one uses the font 
within the desktop publishing package.

I have used that technique and the technique worked very well as the Windows 
operating system treated my font the same way as it did other fonts. With the 
desktop publishing package that I am using (Serif PagePlus version X7) that is 
only using the plane zero Private Use Area.

Thus the providing of information to anyone reading the PDF document is as 
displayed glyphs rather than as code points.

The availability of the Private Use Area allowed me to make such code point 
assignments for the glyphs that I had designed and then use those code points 
in a manner entirely compatible with The Unicode Standard.

William Overington

Monday 20 August 2018

Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread via Unicode


On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:

On 08/20/2018 03:12 PM, Mark Davis ☕️ via Unicode wrote:


... some people who would call a PUA solution either batty > or

crazy.

I don't think it is either batty or crazy. People can certainly use
the PUA to interchange text (assuming that they have downloaded
fonts and keyboards or some other input method beforehand), and
it  can definitely serve as a proof of concept
. Plain symbols — with no interactions between them (like changing
shape with complex scripts), no combining/non-spacing marks, no case
mappings, and so on — are the best possible case for PUA.


It is kind of a bummer, though, that you can't experiment (easily?  or
at all?) in the PUA with scripts that have complex behavior, or even
not-so-complex behavior like accents & combining marks, or RTL
direction (here, also, am I speaking true?  Is there a block of RTL
PUA also?  I guess there's always RLO, but meh.)  Still, maybe it
doesn't really matter much: your special-purpose font can treat any
codepoint any way it likes, right?



Not all properties come from the font. For example a Zhuang character 
PUA font, which supplements CJK ideographs, does not rotate characters 
90 degrees, when change from RTL to vertical display of text.


John Knightley


~mark

Re: Private Use areas

2018-08-20 Thread Ken Whistler via Unicode





On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
Is there a block of RTL PUA also? 


No.

--Ken

Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Mark E. Shoulson via Unicode

On 08/20/2018 03:12 PM, Mark Davis ☕️ via Unicode wrote:

> ... some people who would call a PUA solution either batty
> or crazy.

I don't think it is either batty or crazy. People can certainly use 
the PUA to interchange text (assuming that they have downloaded fonts 
and keyboards or some other input method beforehand), and

it
 can definitely serve as a proof of concept
. Plain symbols — with no interactions between them (like changing 
shape with complex scripts), no combining/non-spacing marks, no case 
mappings, and so on — are the best possible case for PUA.

It is kind of a bummer, though, that you can't experiment (easily? or at 
all?) in the PUA with scripts that have complex behavior, or even 
not-so-complex behavior like accents & combining marks, or RTL direction 
(here, also, am I speaking true?  Is there a block of RTL PUA also?  I 
guess there's always RLO, but meh.)  Still, maybe it doesn't really 
matter much: your special-purpose font can treat any codepoint any way 
it likes, right?

~mark

RE: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Doug Ewell via Unicode

Mark Davis wrote:

> The only caution I would give is that people shouldn't expect general
> purpose software to do anything with PUA text that depends on
> character properties.

Very true, and a good point. People with creative PUA ideas do sometimes
expect this to magically work.

I have anecdotes, if anyone is interested off-list.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Mark Davis ☕️ via Unicode

> ... some people who would call a PUA solution either batty
> or crazy.

I don't think it is either batty or crazy. People can certainly use the PUA
to interchange text (assuming that they have downloaded fonts and keyboards
or some other input method beforehand), and
it
 can definitely serve as a proof of concept
. Plain symbols — with no interactions between them (like changing shape
with complex scripts), no combining/non-spacing marks, no case mappings,
and so on — are the best possible case for PUA.

The only caution I would give is that people shouldn't expect general
purpose software to do anything with PUA text that depends on character
properties.

Mark

On Mon, Aug 20, 2018 at 8:52 PM Doug Ewell via Unicode 
wrote:

> James Kass wrote:
>
> > As a caveat, some Unicode cognoscenti express disdain for the PUA, so
> > there would be some people who would call a PUA solution either batty
> > or crazy.
>
> I'm concerned that the constant "health warnings" about avoiding the PUA
> may have scared everyone away from this primary use case.
>
> Yes, you run the risk of someone else's PUA implementation colliding
> with yours. That's why you create a Private Use Agreement, and make sure
> it's prominently available to people who want to use your solution. It's
> not like there are hundreds of PUA schemes anyway.
>
> Yes, you will have to convert any existing data if the solution ever
> gets encoded in Unicode. That happened for Deseret and Shavian, and
> maybe others, and the sky didn't fall.
>
> People forget that it was the PUA in Shift-JIS, by Japanese mobile
> providers, that provided the platform for emoji to take off to such an
> extent that... well, we know the rest. If private-use is good enough for
> a legacy encoding, it ought to be good enough for Unicode.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Doug Ewell via Unicode

James Kass wrote:

> As a caveat, some Unicode cognoscenti express disdain for the PUA, so
> there would be some people who would call a PUA solution either batty
> or crazy.

I'm concerned that the constant "health warnings" about avoiding the PUA
may have scared everyone away from this primary use case.

Yes, you run the risk of someone else's PUA implementation colliding
with yours. That's why you create a Private Use Agreement, and make sure
it's prominently available to people who want to use your solution. It's
not like there are hundreds of PUA schemes anyway.

Yes, you will have to convert any existing data if the solution ever
gets encoded in Unicode. That happened for Deseret and Shavian, and
maybe others, and the sky didn't fall.

People forget that it was the PUA in Shift-JIS, by Japanese mobile
providers, that provided the platform for emoji to take off to such an
extent that... well, we know the rest. If private-use is good enough for
a legacy encoding, it ought to be good enough for Unicode.

--
Doug Ewell | Thornton, CO, US | ewellic.org

73 matches

Mail list logo