subject:"RE\: PUA"

RE: PUA (BMP) planned characters HTML tables

2019-08-21 Thread Doug Ewell via Unicode

On August 11, I replied to Robert Wheelock:
 
>> I remember that a website that has tables for certain PUA precomposed
>> accented characters that aren’t yet in Unicode (thing like:
>> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-
>> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...).
>
> If you are thinking of these as potential future additions to the
> standard, keep in mind that accented letters that can already be
> represented by a combination of letter + accent will not ever be
> encoded. This is one of the longest-standing principles Unicode has.
 
I missed the possible significance of the Latvian comma below vs.
Marshallese cedilla, which captured most of the ensuing discussion and
morphed into a discussion about different user communities and group
identity.
 
I'd like to restate, since I think the point may have been lost, that
for the OTHER characters Robert mentioned:
 
> H/h-acute, capital T-dieresis, capital H-underbar, acute accented
> Cyrillic vowels, Cyrillic ER/er-caron, ...
 
there does not appear to be any conflicting usage between different user
communities, and no particular difficulty in rendering or otherwise
processing these as combining sequences, using up-to-date fonts and
rendering engines. I suppose Philippe's example of Võro might factor
into whether different groups prefer different appearances for h́, but
otherwise these user-perceived characters seem to be non-controversial.
 
So to reiterate, these characters appear vanishingly unlikely to be
atomically encoded, "yet" or ever, for good reason.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Asmus Freytag via Unicode


  
  
On 8/14/2019 7:49 PM, James Kass via
  Unicode wrote:


  
  On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote:
  
  Empirically, it has been observed that
some distinctions that are claimed by

users, standards developers or implementers were de-facto not
honored by type

developers (and users selecting fonts) as long as the native
text doesn't

contain minimal pairs.

  
  
  Quickly checked a couple of older on-line PDFs and both used the
  comma below unabashedly.
  
  
  Quoting from this page (which appears to be more modern than the
  PDFs),
  
  http://www.trussel2.com/MOD/peloktxt.htm
  
  
  "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo
  juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne
  depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab
  kanooj ememej. Wa in ṃōṃkaj kar ..."
  
  
  It seems that users are happy to employ a dot below in lieu of
  either a comma or cedilla.  This newer web page is from a book
  published in 1978.  There's a scan of the original book cover.
  Although the book title is all caps hand printing it appears that
  commas were used.  The Marshallese orthography which uses
  commas/cedillas is fairly recent, replacing an older scheme
  devised by missionaries.  Perhaps the actual users have already
  resolved this dilemma by simply using dots below.
  
  
  

That may be the case for Marshallese. But
wouldn't surprise me.
  
My comments were based on a different case
of the same kinds of diacritics below (other languages) and at
the time we consulted typographic samples including newsprint
that were using pre-Unicode technologies. In that sense a
cleaner case, because there was no influence by what Unicode did
or didn't do.
Now, having said that, I do get it that some
materials, like text books, online class materials etc. need to
be prepared / printed using the normative style for the given
orthography.
But it's a far cry from claiming that all
text in a given language is invariably done only one way.
A./

Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Richard Wordingham via Unicode

On Wed, 14 Aug 2019 23:32:37 +
James Kass via Unicode  wrote:

> U+0149 has a compatibility decomposition.  It has been deprecated and
> is not rendered identically on my system.
> 'n ŉ
> ( ’n )

Compatibility decompositions are quite a mix, but are generally
expected to render differently.  If they were expected to render the
same, they would normally be canonical decompositions.

U+0149 and its decomposition naturally render very differently with a
monospaced font.  The same goes for the Roman numerals that the Far
East gave us.

Richard.

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode




On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote:

Empirically, it has been observed that some distinctions that are claimed by
users, standards developers or implementers were de-facto not honored by type
developers (and users selecting fonts) as long as the native text doesn't
contain minimal pairs.


Quickly checked a couple of older on-line PDFs and both used the comma 
below unabashedly.


Quoting from this page (which appears to be more modern than the PDFs),
http://www.trussel2.com/MOD/peloktxt.htm

"Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj 
jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in 
eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj 
kar ..."


It seems that users are happy to employ a dot below in lieu of either a 
comma or cedilla.  This newer web page is from a book published in 
1978.  There's a scan of the original book cover. Although the book 
title is all caps hand printing it appears that commas were used.  The 
Marshallese orthography which uses commas/cedillas is fairly recent, 
replacing an older scheme devised by missionaries.  Perhaps the actual 
users have already resolved this dilemma by simply using dots below.

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Asmus Freytag via Unicode


  
  
On 8/14/2019 2:05 AM, James Kass via
  Unicode wrote:

This
  presumes that the premise of user communities feeling strongly
  about the unacceptable aspect of the variants is valid.  Since it
  has been reported and nothing seems to be happening, perhaps the
  casual users aren't terribly concerned.  It's also possible that
  the various user communities have already set up their systems to
  handle things acceptably by installing appropriate fonts.

This is always a good question.
Empirically, it has been observed that some
distinctions that are claimed by users, standards developers or
implementers were de-facto not honored by type developers (and
users selecting fonts) as long as the native text doesn't
contain minimal pairs.
For example, some Latin fonts drop the dot
on the lowercase i for stylistic reasons (or designers use
dotless i in highly designed texts, like book covers, logos,
etc.). That's usually not a problem for ordinary users for
monolingual texts in, say English; even though everyone agrees
that the lowercase i is normally dotted, the absence isn't
noticed by most, and tolerated even by those who do notice it.
However, as soon as a user community sees a
particular variant as signalling their group identity, they will
be very vocal about it - even, interestingly enough, in cases
where de-facto use (e.g. via font selection, and not forced by
implementation defaults) doesn't match that preference. As I
said, we've seen this in the past for some features in some
languages.
Now, which features become strongly
identified with group identity is something that subject to
change over time; this makes it impossible to guarantee both
absolute stability and perfect compatibility; especially if a
combining mark that is used in decompositions needs to
disunified because the range of shapes changes from being
stylistic to normative.
Before Unicode, with character sets limited
to local use, you couldn't create minimal pairs (except if the
variation was part of your language, like Turkish i with/without
dot). So, if font deviated and pushed the stylistic envelope,
the non-preferred form, if used, would still necessarily refer
to the local character; there was no way it could mean anything
else. With Unicode, that's changed, and instead of user
communities treating this as a typographic issue (exclusive use
of preferred font) which is decentralized to document authors
(and perhaps font vendors) it becomes a character coding issue
that is highly visible and centralized.
That in turn can lead to the issue becoming
politicized; and not unlike some grammar issues, where the
supposedly "correct" form is far from universally agreed on in
practice.
A./

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Ken Whistler via Unicode




On 8/14/2019 4:32 PM, James Kass via Unicode wrote:
If a character gets deprecated, can its decomposition type be changed 
from canonical to compatibility?


Simple answer: No.

--Ken

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode





On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote:

I think you'd also have to change the reference glyph of LATIN LOWER
CASE I WITH HEART to show a heart.  That's valid because the UCD trumps
the code charts, and and no Unicode-compliant process may deliberately
render  differently from LATIN LOWER CASE I WITH
HEART.


U+0149 has a compatibility decomposition.  It has been deprecated and is 
not rendered identically on my system.

'n ŉ
( ’n )
If a character gets deprecated, can its decomposition type be changed 
from canonical to compatibility?

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Richard Wordingham via Unicode

On Wed, 14 Aug 2019 09:05:02 +
James Kass via Unicode  wrote:

> The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's
> only in there because of legacy.  It's presence guarantees
> round-tripping with legacy data but it isn't needed for modern data
> or display.  Urge Groups One and Two to encode their data with the
> desired combiner and educate font engine developers about the
> deprecation.  As the rendering engines get updated, the system
> substitution of the wrongly named precomposed glyph will go away.

I think you'd also have to change the reference glyph of LATIN LOWER
CASE I WITH HEART to show a heart.  That's valid because the UCD trumps
the code charts, and and no Unicode-compliant process may deliberately
render  differently from LATIN LOWER CASE I WITH
HEART. 

Richard.

Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode




On 2019-08-12 8:30 AM, Andrew West wrote:

This issue was discussed at WG2 in 2013
(https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf),
when there was a recommendation to encode precomposed letters L and N
with cedilla*with no decomposition*, but that solution does not seem
to have been taken up by the UTC.


Group One dots their lowercase "i" letters with little flowers and Group 
Two dots theirs with little hearts.  Group Two considers flowers 
unacceptable and Group One rejects hearts.  Because of legacy character 
sets there's a precomposed character encoded called "LATIN LOWER CASE I 
WITH HEART", but it was misnamed and is normally drawn with a flower 
instead.  Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING 
HEART" to get the thing to display properly.  But because there's a 
decomposition involved, the font engine substitutes the glyph mapped to 
"LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN 
LOWER CASE I" plus "COMBINING HEART".  This thwarts Group Two because 
they still get the flower.


The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's only 
in there because of legacy.  It's presence guarantees round-tripping 
with legacy data but it isn't needed for modern data or display.  Urge 
Groups One and Two to encode their data with the desired combiner and 
educate font engine developers about the deprecation.  As the rendering 
engines get updated, the system substitution of the wrongly named 
precomposed glyph will go away.


This presumes that the premise of user communities feeling strongly 
about the unacceptable aspect of the variants is valid.  Since it has 
been reported and nothing seems to be happening, perhaps the casual 
users aren't terribly concerned.  It's also possible that the various 
user communities have already set up their systems to handle things 
acceptably by installing appropriate fonts.

Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Andrew West via Unicode

On Mon, 12 Aug 2019 at 02:27, James Kass via Unicode
 wrote:
>
> On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote:
> > If you are thinking of these as potential future additions to the standard, 
> > keep in mind that accented letters that can already be represented by a 
> > combination of letter + accent will not ever be encoded. This is one of the 
> > longest-standing principles Unicode has.

People seem to be ignoring the fact that Marshallese and Latvian both
use L and N with cedilla, but with completely different glyph shapes:

> In January 2013, the Unicode Technical Committee discussed issues for the 
> representation of
> Marshallese orthography. In particular, Marshallese uses the Latin script and 
> requires the letters l,
> m, n, and o with cedilla. Latvian orthography uses the Latin script and 
> requires the letters g, k, l, n,
> and r with comma below. For Marshallese, it is unacceptable to display 
> cedillas as commas below.
> Conversely, for Latvian, it is unacceptable to display commas below as 
> cedillas.

However, as fonts have been following Latvian practice for these
letters (cedilla is displayed as a comma below) since before Unicode,
Marshallese users cannot get their desired outcome using standard
Unicode combining diacritical marks unless they apply a font specially
designed for Marshallese -- which you can never guarantee if you are
writing an email or posting on twitter, etc.

This issue was discussed at WG2 in 2013
(https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf),
when there was a recommendation to encode precomposed letters L and N
with cedilla *with no decomposition*, but that solution does not seem
to have been taken up by the UTC.

Andrew

Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Richard Wordingham via Unicode

On Mon, 12 Aug 2019 01:21:42 +
James Kass via Unicode  wrote:

> There was a time when populating the PUA with precomposed glyphs was 
> necessary for printing or display, but that time has passed.

There is still the issue that in pure X one can't put sequences of
characters on a key; if the application doesn't invoke an input method
one is stuck.  Useful 20-year old proprietary code may be totally unable
to use modern font capabilities.  Don't forget the Cobol Y10k joke.

On Ubuntu at least, there was a period when Emacs couldn't access
X-based input methods from an English locale. The work-around: Use a
Japanese locale plus the vanilla lack of internationalisation in the
interface, or Emacs's very convenient alternative keyboard capability
for text input as opposed to commands.  The bug turned out to be in the
definition of the locales, i.e. in privileged data beyond the purview
of Emacs.

As to the need for the PUA, writing fonts to cope with Tai Tham
rendering engines is not easy, and it's no surprise that the PUA is used
on line for a newspaper that uses the Tai Tham script.  The USE is too
user-hostile for it to have helped if it had been available earlier.
(It just ignored the regular expression published in 2007.
(It's in L2/07-007R in the UTC document register, ISO/IEC
JTC1/SC2/WG2/N3207R on ISO land.) Indeed, perhaps I should be
researching the PUA encoding for Tai Tham. (My Tai Tham font Da Lekh
started as proof of principle, for there is already an unpleasant
amount of glyph sequence changing, some style-dependent. I couldn't see
how to get rendering engine support even when it might be added.  I was
pleasantly surprised at how far from impossible Tai Tham layout was
until the USE came along and made everything harder.  I now have to work
out which glyph instances have already been Indicly rearranged when I
repair the clustering.)

Oh, and i seem to need some PUA codepoints for vowels that get stranded
when line-breaks occur between the columns of an akshara.  The
proposals show this phenomenon in old(?) Pali text.  Or is there any
chance of getting them encoded?

Richard.

Re: PUA (BMP) planned characters HTML tables

2019-08-11 Thread James Kass via Unicode





On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote:

If you are thinking of these as potential future additions to the standard, 
keep in mind that accented letters that can already be represented by a 
combination of letter + accent will not ever be encoded. This is one of the 
longest-standing principles Unicode has.


Good point.

There was a time when populating the PUA with precomposed glyphs was 
necessary for printing or display, but that time has passed. Hopefully 
anyone seeking charts is transcoding older data into proper Unicode.


This can be illustrated with the Marshallese combos mentioned earlier.

PUA:  
Standard:  ĻļM̧m̧ŅņO̧o̧

Well, that didn't work out as well as expected.  But the standard 
Unicode is supported (more or less) by some of the core fonts installed 
here.  Nothing installed here displays anything useful for the PUA 
characters.  A decent OpenType font designed with Marshallese in mind 
should work just fine with the combiners.


The fact is that the standard characters will survive and can be 
universally exchanged.  And there's plenty of web page charts showing 
the standard characters.

RE: PUA (BMP) planned characters HTML tables

2019-08-11 Thread via Unicode

Robert Wheelock wrote:

> I remember that a website that has tables for certain PUA precomposed
> accented characters that aren’t yet in Unicode (thing like:
> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-
> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...).

If you are thinking of these as potential future additions to the standard, 
keep in mind that accented letters that can already be represented by a 
combination of letter + accent will not ever be encoded. This is one of the 
longest-standing principles Unicode has.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: PUA (BMP) planned characters HTML tables

2019-08-11 Thread James Kass via Unicode





On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote:

Hello!
I remember that a website that has tables for certain PUA precomposed
accented characters that aren’t yet in Unicode (thing like:  Marshallese
M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute
accented Cyrillic vowels, Cyrillic ER/er-caron, ...).  Where was it at?!  I
still want to get the information.  Thank You!




It sounds familiar but I can't place it.  I tried the SIL pages first, 
as did Richard Wordingham apparently.


https://blogfonts.com/dehuti.font

This font has material in the PUA including:
Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N 
(E3CE & E3DE), O (E429 & E465)


These appear to be PUA characters which the font developer has mapped in 
addition to the SIL PUA mappings.

Re: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Richard Wordingham via Unicode

On Sun, 11 Aug 2019 00:07:05 -0400
Robert Wheelock via Unicode  wrote:

> I remember that a website that has tables for certain PUA precomposed
> accented characters that aren’t yet in Unicode (thing like:
> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital
> H-underbar, acute accented Cyrillic vowels, Cyrillic
> ER/er-caron, ...).  Where was it at?!  I still want to get the
> information.  Thank You!

You may mean https://www.eki.ee/letter.  Once there, you'll want to make
a query by Unicode range, e.g. e000-f8ff.  It doesn't seem to refer to
the relevant agreement.  You could start hunting for agreements at
https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA

Most of the characters you mention are scheduled to be assigned their
own codepoint on the Greek kalends.  They are precluded by policy
because they would need to be composition exclusions to avoid making
text in NFC cease to be in NFC.

I first thought of the SIL PUA at
https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi=PUA_home ,
but they knew better than to include most of them.

Richard.

RE: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Robert Wheelock via Unicode

Hello!
I remember that a website that has tables for certain PUA precomposed
accented characters that aren’t yet in Unicode (thing like:  Marshallese
M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute
accented Cyrillic vowels, Cyrillic ER/er-caron, ...).  Where was it at?!  I
still want to get the information.  Thank You!

Robert Lloyd Wheelock

Re: PUA as the Wild West [was: SSP default ignorable characters]

2004-04-28 Thread Kenneth Whistler

A propos of the separate thread on the directionality of
Arabic digits...

 At some point it can indeed become unrealistic, snobbish, self-serving,
 and even lazy to just casually toss out the do-it-yourself crumb. 

Thank you, Dean, for casting Persians in your Western. ;-)

 Currently, I view the PUA as practically a wasteland, unusable for even
 for the most basic research work. 

A wise decision, all in all.

 Is it simply out of the question, to review PUA policies and
 implementation in Unicode? Could not the PUA, or possibly multiple PUA's,
 retain their almost wild west independence and entrepreneurial spirit,
 and still have a few sheriffs hanging around here and there to impose
 some minimal expectation of law and order?

But who would you cast in the role of sheriff? James Garner is
getting a little old for that kind of thing, and I didn't see
anyone with the right acting resume among the Persians.

--Ken

Re: PUA properties, default or otherwise (was: Re: What is the principle?)

2004-03-31 Thread Philippe Verdy

From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, March 31, 2004 8:38 AM
Subject: PUA properties, default or otherwise (was: Re: What is the principle?)

 This discussion has focused pretty tightly on the *default* properties
 of PUA code points, without really addressing the issue of specifying
 new properties to override those defaults, and I think that's a mistake.

Exactly what I was saying. But you had more arguments for my remark.

 But Ken and Rick
 are absolutely right that very few companies are going to see a business
 opportunity in this.  Even SC UniPad, which has implemented many
 comparatively arcane features of Unicode, has never done anything with
 the PUA, though it has been on their future versions list for 6 years
 now.

One of the main reason may be that they are exactly limited by the lack of
accurate properties for PUAs.
But I see no reason why there could not exist an interoperable format to send
these properties.
In proposed to include that information in fonts (notably OpenType), but it may
also be sent separately (in a font without the glyphs?)

Of course we can argue that some of the missing features may in some cases be
encoded directly within the maintext (for example by using RLO/PDF controls in
the plain-text to override the BiDi properties.

I also don't think that such application is only for idiosyncratic characters.
There are LOTS of scripts on earth that will probably never go to the scrutiny
of Unicode, but that users may wish to start studying in a interoperable way
with common reusable technical solutions to creater the documents they need. You
may think that using some rich text format (Word DOC, Acrobat PDF, HTML+SVG...)
would paliate the lack of standardization. But I do think that there is still
some place for plain texts.

Re: PUA properties (was: What is the principle?)

2004-03-30 Thread Philippe Verdy

From: Dominikus Scherkl (MGW) [EMAIL PROTECTED]
  They do not. A user of PUA characters is free to define the
  whole range of PUA characters as consisting of strong R-to-L
  characters and implementing accordingly. ...
 
  This is not true! Users can define only those properties which the
  software that they are using allows them to define.

 I would expect any application to allow _all_ properties to be
 defined by the user for each and any PUA charakter.
 If not so, it's a bug in the application!

Certainly NOT a bug, a limitation possibly, but how would you define the user
properties associated with a font that contains all its glyphs in PUAs?

Can a OpenType font specify a table of character properties to enable correct
rendering behavior of plain-text files containing PAUs that are said to be
rendered with a specific font redefining these default PUA properties ?

I looked into OpenType specs, and there's apparently no standard table format
defined that would allow describing those PUAs. It's not a bug, but clearly a
limitation too.

Is there some alternate font formats (other than OpenType/TrueType) where such
properties tables can be defined, notably the BiDi behavior, and the line
breaking opportunities, or (why not?) some case foldings (if one wants for
example to render some styles like small caps with the same PUA font) ?

Re: PUA

2003-10-21 Thread Philippe Verdy

Marco Cimarosti [EMAIL PROTECTED] writes:
 Now, my PuaInterpretation variable contains the following information:

 Foobar.ttf

 And my string contains the following text:

 
 (U+E017 U+E009)

 Now, what's the next step? What am I supposed to do to find out whether,
 according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009
 are letters or not?

Effectively, I don't like the idea of tagging PUA text with font names
tags.

I'd rather prefer tagging the PUA text with script name tags (I mean
the extended user-defined script codes like x-klingon, followed by
a base codepoint indicator and a codespace length like
x-klingon;b=E000;l=80):

- this gives a real interpretation to PUAs, evaluated in their context,

- it allows remapping them locally to other ranges in case of conflict
between
multiple PUA conventions uses

- the script indicator name can be mapped locally to a character properties
database, indexed at the relative codepoint in the PUA convention codespace.

- any number of fonts can be designed to work with PUAs even if they are
sharing conflicting codespaces.

- any language can use this system.

- no more need for extra planes

- experimentation with new scripts still not standardized is possible,
including
for character properties, breaking behavior, layout, grapheme clustering,
...

- emulation of new standardized scripts becomes possible on previous
implementations that lack support for new characters or scripts...

Re: PUA

2003-10-20 Thread Doug Ewell

Chris Jacobs chris dot jacobs at freeler dot nl wrote:

 As I understand the position of the designers of Unicode they
 definitely don't want to be in charge of this and want to let the
 users of the PUA fight it out among themselves.

Come to a mutual agreement is probably more in the spirit.  I doubt
the original designers of Unicode expected much competition among PUA
mappings.

 Nevertheless I think if Unicode don't want to decide how the PUA is
 to be interpreted it should be at the very least provide a mechanism
 by which an user of the PUA can specify which specification he
 prefers.

I'm pretty sure UTC wants to stay as far away as possible from something
like this that could be misunderstood as running a PUA registry.

 I plan to propose such a mechanism:

 I want to propose a char with the following properties:

 Scalar Value: U+E0002

 This starts a PUA interpretation selector tag.
 The content of the tag is a Font family name.
 For all PUA chars between this tag and the corresponding Cancel tag
 the copyright holder of the font is the sole authority about how the
 PUA should be interpreted.

 Any comments?

Plenty.  You're assuming a one-to-one relationship between font and PUA
mapping, and especially between font maker and PUA registration
authority, that doesn't necessarily exist.  Code2000, for instance, is
not the only font that covers some of the ConScript ranges, particularly
Tengwar and Klingon.  For the PUA mappings established by Microsoft and
Apple, there are numerous fonts distributed not only by those companies,
but by others.

Ideally, PUA characters should also have complete (or nearly complete)
information on Unicode properties, such as directionality and combining
class.  This isn't necessarily the kind of information you could get by
asking the font vendor or examining a font file.  Font files don't even
have Unicode character names, just short identifiers like aacute.

Despite the wording For all PUA chars..., there is no real guarantee
that an implementation would respect this font tag for PUA characters
only, and I think there'd have to be.

Finally, there is not a great sentiment within the UTC for expanding the
role of Plane 14 tags in general.  In my November 2002 paper In defense
of Plane 14 language tags (L2/02-396R), I wrote that deprecating those
tags (which was under discussion at the time) would implicitly deprecate
the entire concept of Plane 14 tagging, and discourage the introduction
of new, non-language-related Plane 14 tags like the one you describe.
As it turns out, there are those who feel that would be a good thing.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: PUA

2003-10-20 Thread Marco Cimarosti

Chris Jacobs wrote:
 [...]
 Nevertheless I think if Unicode don't want to decide how the 
 PUA is to be interpreted

Please take notice of this interpreted: I'll come back to this soon.

 it should be at the very least provide a mechanism by which
 an user of the PUA can specify which specification he
 prefers.
 
 I plan to propose such a mechanism:
 
 I want to propose a char with the following properties:
 
 Scalar Value: U+E0002
 
 This starts a PUA interpretation

Again, please take notice of this interpretation.

 selector tag. The content of the tag is a Font family
 name. For all PUA chars between this tag and the
 corresponding Cancel tag the copyright holder of the font
 is the sole authority about how the PUA should
 be interpreted.

Again, interpreted...

 Any comments?

Yes.

A font tells me how a certain run of text should be *displayed* in rich
text, not how it should be *interpreted* in plain text.

Imagine that I have been asked to write a function AreTheseLetters() which
gets a string argument (i.e., a piece of plain text) and returns a Boolean
value indicating whether all the characters in it are letters.

For non-PUA characters, I already implemented this using Unicode's General
Category property: I decided that all characters whose General Category is
L* are letters. My default assumption about PUA characters is that they
are not letters.

So far so good. Now I want to use your PUA Plan-14 tags, if present, to
override the above assumption about PUA characters. E.g., imagine that my
string contains this:


󠀀󠀂󠁆󠁯󠁏󠁢󠁡󠁲󠀮󠁴󠁴󠁦󠁿
(U+0E U+0E0002 U+0E0046 U+0E006F U+0E004F U+0E0062 U+0E0061
U+0E0072 U+0E002E U+0E0074 U+0E0074 U+0E0066 U+0E007F U+E017 U+E009)

This is what I am going to do:

1) I parsing the tags at the beginning of the string and save the relevant
information in a temporary variable which we will call PuaInterpretation;

2) I remove the tags.

Now, my PuaInterpretation variable contains the following information:

Foobar.ttf

And my string contains the following text:


(U+E017 U+E009)

Now, what's the next step? What am I supposed to do to find out whether,
according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009
are letters or not?

_ Marco

RE: PUA

2003-10-20 Thread Kent Karlsson


...
 For non-PUA characters, I already implemented this using Unicode's
 General Category property: I decided that all characters whose General
 Category is L* are letters.

Nit: That isn't quite true (but I'm not doubting your choice). The
HANGUL * FILLER characters aren't letters, even though they are of
GC Lo. Indeed, they are even invisible (but the Jamo ones are needed
for representing isolated letters using Jamos in the adopted architecture
for Hangul in Unicode; the non-Jamo Hangul fillers are there just
for compatibility with an older standard, nothing lettery about them).
Nor are LAO ELLIPSIS and THAI CHARACTER PAIYANNOI letters,
though Lo. They are really punctuation.

 My default assumption about PUA characters is that they are not letters.

Hmm. A common default seems to be to treat them as CJK. Non-PUA
CJK is Lo... (Except for radicals, which are So.) Granted, I'm not too
fond
of that default myself. The situation is a bit similar for Braille, where
the
glyphs are given, but nothing much else.

/kent k


smime.p7s
Description: S/MIME cryptographic signature

RE: PUA

2003-10-20 Thread jameskass

.
Marco Cimarosti wrote,

 
 So far so good. Now I want to use your PUA Plan-14 tags, if present, to
 override the above assumption about PUA characters. E.g., imagine that my
 string contains this:
 
   
 FoObar.ttf ?
   (U+0E U+0E0002 U+0E0046 U+0E006F U+0E004F U+0E0062 U+0E0061
 U+0E0072 U+0E002E U+0E0074 U+0E0074 U+0E0066 U+0E007F U+E017 U+E009)
 
 This is what I am going to do:
 
 1) I parsing the tags at the beginning of the string and save the relevant
 information in a temporary variable which we will call PuaInterpretation;
 
 2) I remove the tags.
 
 Now, my PuaInterpretation variable contains the following information:
 
   Foobar.ttf
 
 And my string contains the following text:
 
   
   (U+E017 U+E009)
 
 Now, what's the next step? What am I supposed to do to find out whether,
 according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009
 are letters or not?
 

Hmmm, the UTF-8 non-BMP string apparently got munged.

Anyway, the next step is for your function to load the file 
Foobar.puapropertiesclass.

This file is a plain-text file following the same format as UNIDATA.  It's
extensible -- if the font vendor doesn't include it with the font download,
then the savvy end-user can simply construct it with a plain-text editor.

Now your function has all the necessary information and can determine 
whether the PUA code points are letters, or not.

Best regards,

James Kass
.

Re: PUA

2003-10-19 Thread Asmus Freytag

Why does this have to be in 'plain text'??

Plain text can be streams or strings. For streams, such a mechanism might 
make sense, if you could identify a compelling case that's not better 
handled by HTML, XML etc.

For strings, embedding font names in front of characters just violates some 
implicit assumptions, e.g. that the average string is 'short', that the 
number of bytes are a small and at least probabilistically determinable 
multiple of the number of character, etc. etc. Not to forget that strings 
are often assumed to be the plainest of plain text.

A lot of architectures will break if you violate these implicit assumptions 
by hosting a mini-markup inside a string. And for at least half of them (my 
scientific estimate) performance will prevent them from doing anything 
about it, so you are stuck.

The language tagging scheme was designed for use with a string based 
protocol, but one where the protocol contained the rules of interpreting 
any tagging. What you are proposing is something that's supposed to just 
infect any run of characters without warning.

Who's going to implement this, why, where and when?

A./

At 04:34 AM 10/20/03 +0200, Chris Jacobs wrote:

- Original Message -
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; Tom Gewecke [EMAIL PROTECTED]
Sent: Sunday, October 19, 2003 8:32 PM
Subject: Re: Klingons and their allies - Beyond 17 planes
 jameskass at att dot net wrote:

  In addition to the problem of the OS substituting improper glyphs
  from inappropriate fonts unexpectedly, there's often a problem with
  line breaking.
 
  Since the PUA has no properties, some applications seem to ignore the
  space character and break lines arbitrarily, splitting words in the
  middle.

 That's exactly what happens in my sample pages.  I didn't think it was
 because the PUA had no properties so much as default properties,
 which (as Thomas Chan indicated) might be Han-based or Han-influenced.
 You can always switch to a font that will display glyphs for your PUA
 characters, but it's harder to adapt a rendering engine to observe PUA
 character properties.
One problem is that there seems to be no way in plaintext unicode to specify
who is in charge of a particular interpretation of the PUA.
As I understand the position of the designers of Unicode they definitely
don't want to be in charge of this and want to let the users of the PUA
fight it out among themselves.
Nevertheless I think if Unicode don't want to decide how the PUA is to be
interpreted it should be at the very least provide a mechanism by which an
user of the PUA can specify which specification he prefers.
I plan to propose such a mechanism:

I want to propose a char with the following properties:

Scalar Value: U+E0002

This starts a PUA interpretation selector tag.
The content of the tag is a Font family name.
For all PUA chars between this tag and the corresponding Cancel tag the
copyright holder of the font is the sole authority about how the PUA should
be interpreted.
Any comments?

 In any case, I am absolutely certain :-) :-) that the arbitrary mid-word
 line breaking is what has discouraged would-be readers from pointing out
 the typo (since fixed) in my transcription of a Dorothy Parker poem:

 http://users.adelphia.net/~dewell/sopp-ew.html

 -Doug Ewell
  Fullerton, California
  http://users.adelphia.net/~dewell/

Re: PUA

2003-10-19 Thread Curtis Clark

on 2003-10-19 19:34 Chris Jacobs wrote:

One problem is that there seems to be no way in plaintext unicode to specify
who is in charge of a particular interpretation of the PUA.
At last! Another use for Plane 14! :-)

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

RE: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

RE: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

Re: PUA (BMP) planned characters HTML tables

RE: PUA (BMP) planned characters HTML tables

Re: PUA as the Wild West [was: SSP default ignorable characters]

Re: PUA properties, default or otherwise (was: Re: What is the principle?)

Re: PUA properties (was: What is the principle?)

Re: PUA

Re: PUA

RE: PUA

RE: PUA

RE: PUA

Re: PUA

Re: PUA

26 matches

Site Navigation

Mail list logo

Footer information