RE: PUA (BMP) planned characters HTML tables
On August 11, I replied to Robert Wheelock: >> I remember that a website that has tables for certain PUA precomposed >> accented characters that aren’t yet in Unicode (thing like: >> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H- >> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). > > If you are thinking of these as potential future additions to the > standard, keep in mind that accented letters that can already be > represented by a combination of letter + accent will not ever be > encoded. This is one of the longest-standing principles Unicode has. I missed the possible significance of the Latvian comma below vs. Marshallese cedilla, which captured most of the ensuing discussion and morphed into a discussion about different user communities and group identity. I'd like to restate, since I think the point may have been lost, that for the OTHER characters Robert mentioned: > H/h-acute, capital T-dieresis, capital H-underbar, acute accented > Cyrillic vowels, Cyrillic ER/er-caron, ... there does not appear to be any conflicting usage between different user communities, and no particular difficulty in rendering or otherwise processing these as combining sequences, using up-to-date fonts and rendering engines. I suppose Philippe's example of Võro might factor into whether different groups prefer different appearances for h́, but otherwise these user-perceived characters seem to be non-controversial. So to reiterate, these characters appear vanishingly unlikely to be atomically encoded, "yet" or ever, for good reason. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 7:49 PM, James Kass via Unicode wrote: On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote: Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. Quickly checked a couple of older on-line PDFs and both used the comma below unabashedly. Quoting from this page (which appears to be more modern than the PDFs), http://www.trussel2.com/MOD/peloktxt.htm "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj kar ..." It seems that users are happy to employ a dot below in lieu of either a comma or cedilla. This newer web page is from a book published in 1978. There's a scan of the original book cover. Although the book title is all caps hand printing it appears that commas were used. The Marshallese orthography which uses commas/cedillas is fairly recent, replacing an older scheme devised by missionaries. Perhaps the actual users have already resolved this dilemma by simply using dots below. That may be the case for Marshallese. But wouldn't surprise me. My comments were based on a different case of the same kinds of diacritics below (other languages) and at the time we consulted typographic samples including newsprint that were using pre-Unicode technologies. In that sense a cleaner case, because there was no influence by what Unicode did or didn't do. Now, having said that, I do get it that some materials, like text books, online class materials etc. need to be prepared / printed using the normative style for the given orthography. But it's a far cry from claiming that all text in a given language is invariably done only one way. A./
Re: PUA (BMP) planned characters HTML tables
On Wed, 14 Aug 2019 23:32:37 + James Kass via Unicode wrote: > U+0149 has a compatibility decomposition. It has been deprecated and > is not rendered identically on my system. > 'n ʼn > ( ’n ) Compatibility decompositions are quite a mix, but are generally expected to render differently. If they were expected to render the same, they would normally be canonical decompositions. U+0149 and its decomposition naturally render very differently with a monospaced font. The same goes for the Roman numerals that the Far East gave us. Richard.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote: Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. Quickly checked a couple of older on-line PDFs and both used the comma below unabashedly. Quoting from this page (which appears to be more modern than the PDFs), http://www.trussel2.com/MOD/peloktxt.htm "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj kar ..." It seems that users are happy to employ a dot below in lieu of either a comma or cedilla. This newer web page is from a book published in 1978. There's a scan of the original book cover. Although the book title is all caps hand printing it appears that commas were used. The Marshallese orthography which uses commas/cedillas is fairly recent, replacing an older scheme devised by missionaries. Perhaps the actual users have already resolved this dilemma by simply using dots below.
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 2:05 AM, James Kass via Unicode wrote: This presumes that the premise of user communities feeling strongly about the unacceptable aspect of the variants is valid. Since it has been reported and nothing seems to be happening, perhaps the casual users aren't terribly concerned. It's also possible that the various user communities have already set up their systems to handle things acceptably by installing appropriate fonts. This is always a good question. Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. For example, some Latin fonts drop the dot on the lowercase i for stylistic reasons (or designers use dotless i in highly designed texts, like book covers, logos, etc.). That's usually not a problem for ordinary users for monolingual texts in, say English; even though everyone agrees that the lowercase i is normally dotted, the absence isn't noticed by most, and tolerated even by those who do notice it. However, as soon as a user community sees a particular variant as signalling their group identity, they will be very vocal about it - even, interestingly enough, in cases where de-facto use (e.g. via font selection, and not forced by implementation defaults) doesn't match that preference. As I said, we've seen this in the past for some features in some languages. Now, which features become strongly identified with group identity is something that subject to change over time; this makes it impossible to guarantee both absolute stability and perfect compatibility; especially if a combining mark that is used in decompositions needs to disunified because the range of shapes changes from being stylistic to normative. Before Unicode, with character sets limited to local use, you couldn't create minimal pairs (except if the variation was part of your language, like Turkish i with/without dot). So, if font deviated and pushed the stylistic envelope, the non-preferred form, if used, would still necessarily refer to the local character; there was no way it could mean anything else. With Unicode, that's changed, and instead of user communities treating this as a typographic issue (exclusive use of preferred font) which is decentralized to document authors (and perhaps font vendors) it becomes a character coding issue that is highly visible and centralized. That in turn can lead to the issue becoming politicized; and not unlike some grammar issues, where the supposedly "correct" form is far from universally agreed on in practice. A./
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 4:32 PM, James Kass via Unicode wrote: If a character gets deprecated, can its decomposition type be changed from canonical to compatibility? Simple answer: No. --Ken
Re: PUA (BMP) planned characters HTML tables
On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote: I think you'd also have to change the reference glyph of LATIN LOWER CASE I WITH HEART to show a heart. That's valid because the UCD trumps the code charts, and and no Unicode-compliant process may deliberately render differently from LATIN LOWER CASE I WITH HEART. U+0149 has a compatibility decomposition. It has been deprecated and is not rendered identically on my system. 'n ʼn ( ’n ) If a character gets deprecated, can its decomposition type be changed from canonical to compatibility?
Re: PUA (BMP) planned characters HTML tables
On Wed, 14 Aug 2019 09:05:02 + James Kass via Unicode wrote: > The solution is to deprecate "LATIN LOWER CASE I WITH HEART". It's > only in there because of legacy. It's presence guarantees > round-tripping with legacy data but it isn't needed for modern data > or display. Urge Groups One and Two to encode their data with the > desired combiner and educate font engine developers about the > deprecation. As the rendering engines get updated, the system > substitution of the wrongly named precomposed glyph will go away. I think you'd also have to change the reference glyph of LATIN LOWER CASE I WITH HEART to show a heart. That's valid because the UCD trumps the code charts, and and no Unicode-compliant process may deliberately render differently from LATIN LOWER CASE I WITH HEART. Richard.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-12 8:30 AM, Andrew West wrote: This issue was discussed at WG2 in 2013 (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), when there was a recommendation to encode precomposed letters L and N with cedilla*with no decomposition*, but that solution does not seem to have been taken up by the UTC. Group One dots their lowercase "i" letters with little flowers and Group Two dots theirs with little hearts. Group Two considers flowers unacceptable and Group One rejects hearts. Because of legacy character sets there's a precomposed character encoded called "LATIN LOWER CASE I WITH HEART", but it was misnamed and is normally drawn with a flower instead. Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING HEART" to get the thing to display properly. But because there's a decomposition involved, the font engine substitutes the glyph mapped to "LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN LOWER CASE I" plus "COMBINING HEART". This thwarts Group Two because they still get the flower. The solution is to deprecate "LATIN LOWER CASE I WITH HEART". It's only in there because of legacy. It's presence guarantees round-tripping with legacy data but it isn't needed for modern data or display. Urge Groups One and Two to encode their data with the desired combiner and educate font engine developers about the deprecation. As the rendering engines get updated, the system substitution of the wrongly named precomposed glyph will go away. This presumes that the premise of user communities feeling strongly about the unacceptable aspect of the variants is valid. Since it has been reported and nothing seems to be happening, perhaps the casual users aren't terribly concerned. It's also possible that the various user communities have already set up their systems to handle things acceptably by installing appropriate fonts.
Re: PUA (BMP) planned characters HTML tables
On Mon, 12 Aug 2019 at 02:27, James Kass via Unicode wrote: > > On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: > > If you are thinking of these as potential future additions to the standard, > > keep in mind that accented letters that can already be represented by a > > combination of letter + accent will not ever be encoded. This is one of the > > longest-standing principles Unicode has. People seem to be ignoring the fact that Marshallese and Latvian both use L and N with cedilla, but with completely different glyph shapes: > In January 2013, the Unicode Technical Committee discussed issues for the > representation of > Marshallese orthography. In particular, Marshallese uses the Latin script and > requires the letters l, > m, n, and o with cedilla. Latvian orthography uses the Latin script and > requires the letters g, k, l, n, > and r with comma below. For Marshallese, it is unacceptable to display > cedillas as commas below. > Conversely, for Latvian, it is unacceptable to display commas below as > cedillas. However, as fonts have been following Latvian practice for these letters (cedilla is displayed as a comma below) since before Unicode, Marshallese users cannot get their desired outcome using standard Unicode combining diacritical marks unless they apply a font specially designed for Marshallese -- which you can never guarantee if you are writing an email or posting on twitter, etc. This issue was discussed at WG2 in 2013 (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), when there was a recommendation to encode precomposed letters L and N with cedilla *with no decomposition*, but that solution does not seem to have been taken up by the UTC. Andrew
Re: PUA (BMP) planned characters HTML tables
On Mon, 12 Aug 2019 01:21:42 + James Kass via Unicode wrote: > There was a time when populating the PUA with precomposed glyphs was > necessary for printing or display, but that time has passed. There is still the issue that in pure X one can't put sequences of characters on a key; if the application doesn't invoke an input method one is stuck. Useful 20-year old proprietary code may be totally unable to use modern font capabilities. Don't forget the Cobol Y10k joke. On Ubuntu at least, there was a period when Emacs couldn't access X-based input methods from an English locale. The work-around: Use a Japanese locale plus the vanilla lack of internationalisation in the interface, or Emacs's very convenient alternative keyboard capability for text input as opposed to commands. The bug turned out to be in the definition of the locales, i.e. in privileged data beyond the purview of Emacs. As to the need for the PUA, writing fonts to cope with Tai Tham rendering engines is not easy, and it's no surprise that the PUA is used on line for a newspaper that uses the Tai Tham script. The USE is too user-hostile for it to have helped if it had been available earlier. (It just ignored the regular expression published in 2007. (It's in L2/07-007R in the UTC document register, ISO/IEC JTC1/SC2/WG2/N3207R on ISO land.) Indeed, perhaps I should be researching the PUA encoding for Tai Tham. (My Tai Tham font Da Lekh started as proof of principle, for there is already an unpleasant amount of glyph sequence changing, some style-dependent. I couldn't see how to get rendering engine support even when it might be added. I was pleasantly surprised at how far from impossible Tai Tham layout was until the USE came along and made everything harder. I now have to work out which glyph instances have already been Indicly rearranged when I repair the clustering.) Oh, and i seem to need some PUA codepoints for vowels that get stranded when line-breaks occur between the columns of an akshara. The proposals show this phenomenon in old(?) Pali text. Or is there any chance of getting them encoded? Richard.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. Good point. There was a time when populating the PUA with precomposed glyphs was necessary for printing or display, but that time has passed. Hopefully anyone seeking charts is transcoding older data into proper Unicode. This can be illustrated with the Marshallese combos mentioned earlier. PUA: Standard: ĻļM̧m̧ŅņO̧o̧ Well, that didn't work out as well as expected. But the standard Unicode is supported (more or less) by some of the core fonts installed here. Nothing installed here displays anything useful for the PUA characters. A decent OpenType font designed with Marshallese in mind should work just fine with the combiners. The fact is that the standard characters will survive and can be universally exchanged. And there's plenty of web page charts showing the standard characters.
RE: PUA (BMP) planned characters HTML tables
Robert Wheelock wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren’t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H- > underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: PUA (BMP) planned characters HTML tables
On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote: Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren’t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! It sounds familiar but I can't place it. I tried the SIL pages first, as did Richard Wordingham apparently. https://blogfonts.com/dehuti.font This font has material in the PUA including: Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N (E3CE & E3DE), O (E429 & E465) These appear to be PUA characters which the font developer has mapped in addition to the SIL PUA mappings.
Re: PUA (BMP) planned characters HTML tables
On Sun, 11 Aug 2019 00:07:05 -0400 Robert Wheelock via Unicode wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren’t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital > H-underbar, acute accented Cyrillic vowels, Cyrillic > ER/er-caron, ...). Where was it at?! I still want to get the > information. Thank You! You may mean https://www.eki.ee/letter. Once there, you'll want to make a query by Unicode range, e.g. e000-f8ff. It doesn't seem to refer to the relevant agreement. You could start hunting for agreements at https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA Most of the characters you mention are scheduled to be assigned their own codepoint on the Greek kalends. They are precluded by policy because they would need to be composition exclusions to avoid making text in NFC cease to be in NFC. I first thought of the SIL PUA at https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi=PUA_home , but they knew better than to include most of them. Richard.
RE: PUA (BMP) planned characters HTML tables
Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren’t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! Robert Lloyd Wheelock
Re: PUA as the Wild West [was: SSP default ignorable characters]
A propos of the separate thread on the directionality of Arabic digits... At some point it can indeed become unrealistic, snobbish, self-serving, and even lazy to just casually toss out the do-it-yourself crumb. Thank you, Dean, for casting Persians in your Western. ;-) Currently, I view the PUA as practically a wasteland, unusable for even for the most basic research work. A wise decision, all in all. Is it simply out of the question, to review PUA policies and implementation in Unicode? Could not the PUA, or possibly multiple PUA's, retain their almost wild west independence and entrepreneurial spirit, and still have a few sheriffs hanging around here and there to impose some minimal expectation of law and order? But who would you cast in the role of sheriff? James Garner is getting a little old for that kind of thing, and I didn't see anyone with the right acting resume among the Persians. --Ken
Re: PUA properties, default or otherwise (was: Re: What is the principle?)
From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, March 31, 2004 8:38 AM Subject: PUA properties, default or otherwise (was: Re: What is the principle?) This discussion has focused pretty tightly on the *default* properties of PUA code points, without really addressing the issue of specifying new properties to override those defaults, and I think that's a mistake. Exactly what I was saying. But you had more arguments for my remark. But Ken and Rick are absolutely right that very few companies are going to see a business opportunity in this. Even SC UniPad, which has implemented many comparatively arcane features of Unicode, has never done anything with the PUA, though it has been on their future versions list for 6 years now. One of the main reason may be that they are exactly limited by the lack of accurate properties for PUAs. But I see no reason why there could not exist an interoperable format to send these properties. In proposed to include that information in fonts (notably OpenType), but it may also be sent separately (in a font without the glyphs?) Of course we can argue that some of the missing features may in some cases be encoded directly within the maintext (for example by using RLO/PDF controls in the plain-text to override the BiDi properties. I also don't think that such application is only for idiosyncratic characters. There are LOTS of scripts on earth that will probably never go to the scrutiny of Unicode, but that users may wish to start studying in a interoperable way with common reusable technical solutions to creater the documents they need. You may think that using some rich text format (Word DOC, Acrobat PDF, HTML+SVG...) would paliate the lack of standardization. But I do think that there is still some place for plain texts.
Re: PUA properties (was: What is the principle?)
From: Dominikus Scherkl (MGW) [EMAIL PROTECTED] They do not. A user of PUA characters is free to define the whole range of PUA characters as consisting of strong R-to-L characters and implementing accordingly. ... This is not true! Users can define only those properties which the software that they are using allows them to define. I would expect any application to allow _all_ properties to be defined by the user for each and any PUA charakter. If not so, it's a bug in the application! Certainly NOT a bug, a limitation possibly, but how would you define the user properties associated with a font that contains all its glyphs in PUAs? Can a OpenType font specify a table of character properties to enable correct rendering behavior of plain-text files containing PAUs that are said to be rendered with a specific font redefining these default PUA properties ? I looked into OpenType specs, and there's apparently no standard table format defined that would allow describing those PUAs. It's not a bug, but clearly a limitation too. Is there some alternate font formats (other than OpenType/TrueType) where such properties tables can be defined, notably the BiDi behavior, and the line breaking opportunities, or (why not?) some case foldings (if one wants for example to render some styles like small caps with the same PUA font) ?
Re: PUA
Marco Cimarosti [EMAIL PROTECTED] writes: Now, my PuaInterpretation variable contains the following information: Foobar.ttf And my string contains the following text: (U+E017 U+E009) Now, what's the next step? What am I supposed to do to find out whether, according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009 are letters or not? Effectively, I don't like the idea of tagging PUA text with font names tags. I'd rather prefer tagging the PUA text with script name tags (I mean the extended user-defined script codes like x-klingon, followed by a base codepoint indicator and a codespace length like x-klingon;b=E000;l=80): - this gives a real interpretation to PUAs, evaluated in their context, - it allows remapping them locally to other ranges in case of conflict between multiple PUA conventions uses - the script indicator name can be mapped locally to a character properties database, indexed at the relative codepoint in the PUA convention codespace. - any number of fonts can be designed to work with PUAs even if they are sharing conflicting codespaces. - any language can use this system. - no more need for extra planes - experimentation with new scripts still not standardized is possible, including for character properties, breaking behavior, layout, grapheme clustering, ... - emulation of new standardized scripts becomes possible on previous implementations that lack support for new characters or scripts...
Re: PUA
Chris Jacobs chris dot jacobs at freeler dot nl wrote: As I understand the position of the designers of Unicode they definitely don't want to be in charge of this and want to let the users of the PUA fight it out among themselves. Come to a mutual agreement is probably more in the spirit. I doubt the original designers of Unicode expected much competition among PUA mappings. Nevertheless I think if Unicode don't want to decide how the PUA is to be interpreted it should be at the very least provide a mechanism by which an user of the PUA can specify which specification he prefers. I'm pretty sure UTC wants to stay as far away as possible from something like this that could be misunderstood as running a PUA registry. I plan to propose such a mechanism: I want to propose a char with the following properties: Scalar Value: U+E0002 This starts a PUA interpretation selector tag. The content of the tag is a Font family name. For all PUA chars between this tag and the corresponding Cancel tag the copyright holder of the font is the sole authority about how the PUA should be interpreted. Any comments? Plenty. You're assuming a one-to-one relationship between font and PUA mapping, and especially between font maker and PUA registration authority, that doesn't necessarily exist. Code2000, for instance, is not the only font that covers some of the ConScript ranges, particularly Tengwar and Klingon. For the PUA mappings established by Microsoft and Apple, there are numerous fonts distributed not only by those companies, but by others. Ideally, PUA characters should also have complete (or nearly complete) information on Unicode properties, such as directionality and combining class. This isn't necessarily the kind of information you could get by asking the font vendor or examining a font file. Font files don't even have Unicode character names, just short identifiers like aacute. Despite the wording For all PUA chars..., there is no real guarantee that an implementation would respect this font tag for PUA characters only, and I think there'd have to be. Finally, there is not a great sentiment within the UTC for expanding the role of Plane 14 tags in general. In my November 2002 paper In defense of Plane 14 language tags (L2/02-396R), I wrote that deprecating those tags (which was under discussion at the time) would implicitly deprecate the entire concept of Plane 14 tagging, and discourage the introduction of new, non-language-related Plane 14 tags like the one you describe. As it turns out, there are those who feel that would be a good thing. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: PUA
Chris Jacobs wrote: [...] Nevertheless I think if Unicode don't want to decide how the PUA is to be interpreted Please take notice of this interpreted: I'll come back to this soon. it should be at the very least provide a mechanism by which an user of the PUA can specify which specification he prefers. I plan to propose such a mechanism: I want to propose a char with the following properties: Scalar Value: U+E0002 This starts a PUA interpretation Again, please take notice of this interpretation. selector tag. The content of the tag is a Font family name. For all PUA chars between this tag and the corresponding Cancel tag the copyright holder of the font is the sole authority about how the PUA should be interpreted. Again, interpreted... Any comments? Yes. A font tells me how a certain run of text should be *displayed* in rich text, not how it should be *interpreted* in plain text. Imagine that I have been asked to write a function AreTheseLetters() which gets a string argument (i.e., a piece of plain text) and returns a Boolean value indicating whether all the characters in it are letters. For non-PUA characters, I already implemented this using Unicode's General Category property: I decided that all characters whose General Category is L* are letters. My default assumption about PUA characters is that they are not letters. So far so good. Now I want to use your PUA Plan-14 tags, if present, to override the above assumption about PUA characters. E.g., imagine that my string contains this: (U+0E U+0E0002 U+0E0046 U+0E006F U+0E004F U+0E0062 U+0E0061 U+0E0072 U+0E002E U+0E0074 U+0E0074 U+0E0066 U+0E007F U+E017 U+E009) This is what I am going to do: 1) I parsing the tags at the beginning of the string and save the relevant information in a temporary variable which we will call PuaInterpretation; 2) I remove the tags. Now, my PuaInterpretation variable contains the following information: Foobar.ttf And my string contains the following text: (U+E017 U+E009) Now, what's the next step? What am I supposed to do to find out whether, according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009 are letters or not? _ Marco
RE: PUA
... For non-PUA characters, I already implemented this using Unicode's General Category property: I decided that all characters whose General Category is L* are letters. Nit: That isn't quite true (but I'm not doubting your choice). The HANGUL * FILLER characters aren't letters, even though they are of GC Lo. Indeed, they are even invisible (but the Jamo ones are needed for representing isolated letters using Jamos in the adopted architecture for Hangul in Unicode; the non-Jamo Hangul fillers are there just for compatibility with an older standard, nothing lettery about them). Nor are LAO ELLIPSIS and THAI CHARACTER PAIYANNOI letters, though Lo. They are really punctuation. My default assumption about PUA characters is that they are not letters. Hmm. A common default seems to be to treat them as CJK. Non-PUA CJK is Lo... (Except for radicals, which are So.) Granted, I'm not too fond of that default myself. The situation is a bit similar for Braille, where the glyphs are given, but nothing much else. /kent k smime.p7s Description: S/MIME cryptographic signature
RE: PUA
. Marco Cimarosti wrote, So far so good. Now I want to use your PUA Plan-14 tags, if present, to override the above assumption about PUA characters. E.g., imagine that my string contains this: FoObar.ttf ? (U+0E U+0E0002 U+0E0046 U+0E006F U+0E004F U+0E0062 U+0E0061 U+0E0072 U+0E002E U+0E0074 U+0E0074 U+0E0066 U+0E007F U+E017 U+E009) This is what I am going to do: 1) I parsing the tags at the beginning of the string and save the relevant information in a temporary variable which we will call PuaInterpretation; 2) I remove the tags. Now, my PuaInterpretation variable contains the following information: Foobar.ttf And my string contains the following text: (U+E017 U+E009) Now, what's the next step? What am I supposed to do to find out whether, according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009 are letters or not? Hmmm, the UTF-8 non-BMP string apparently got munged. Anyway, the next step is for your function to load the file Foobar.puapropertiesclass. This file is a plain-text file following the same format as UNIDATA. It's extensible -- if the font vendor doesn't include it with the font download, then the savvy end-user can simply construct it with a plain-text editor. Now your function has all the necessary information and can determine whether the PUA code points are letters, or not. Best regards, James Kass .
Re: PUA
Why does this have to be in 'plain text'?? Plain text can be streams or strings. For streams, such a mechanism might make sense, if you could identify a compelling case that's not better handled by HTML, XML etc. For strings, embedding font names in front of characters just violates some implicit assumptions, e.g. that the average string is 'short', that the number of bytes are a small and at least probabilistically determinable multiple of the number of character, etc. etc. Not to forget that strings are often assumed to be the plainest of plain text. A lot of architectures will break if you violate these implicit assumptions by hosting a mini-markup inside a string. And for at least half of them (my scientific estimate) performance will prevent them from doing anything about it, so you are stuck. The language tagging scheme was designed for use with a string based protocol, but one where the protocol contained the rules of interpreting any tagging. What you are proposing is something that's supposed to just infect any run of characters without warning. Who's going to implement this, why, where and when? A./ At 04:34 AM 10/20/03 +0200, Chris Jacobs wrote: - Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; Tom Gewecke [EMAIL PROTECTED] Sent: Sunday, October 19, 2003 8:32 PM Subject: Re: Klingons and their allies - Beyond 17 planes jameskass at att dot net wrote: In addition to the problem of the OS substituting improper glyphs from inappropriate fonts unexpectedly, there's often a problem with line breaking. Since the PUA has no properties, some applications seem to ignore the space character and break lines arbitrarily, splitting words in the middle. That's exactly what happens in my sample pages. I didn't think it was because the PUA had no properties so much as default properties, which (as Thomas Chan indicated) might be Han-based or Han-influenced. You can always switch to a font that will display glyphs for your PUA characters, but it's harder to adapt a rendering engine to observe PUA character properties. One problem is that there seems to be no way in plaintext unicode to specify who is in charge of a particular interpretation of the PUA. As I understand the position of the designers of Unicode they definitely don't want to be in charge of this and want to let the users of the PUA fight it out among themselves. Nevertheless I think if Unicode don't want to decide how the PUA is to be interpreted it should be at the very least provide a mechanism by which an user of the PUA can specify which specification he prefers. I plan to propose such a mechanism: I want to propose a char with the following properties: Scalar Value: U+E0002 This starts a PUA interpretation selector tag. The content of the tag is a Font family name. For all PUA chars between this tag and the corresponding Cancel tag the copyright holder of the font is the sole authority about how the PUA should be interpreted. Any comments? In any case, I am absolutely certain :-) :-) that the arbitrary mid-word line breaking is what has discouraged would-be readers from pointing out the typo (since fixed) in my transcription of a Dorothy Parker poem: http://users.adelphia.net/~dewell/sopp-ew.html -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: PUA
on 2003-10-19 19:34 Chris Jacobs wrote: One problem is that there seems to be no way in plaintext unicode to specify who is in charge of a particular interpretation of the PUA. At last! Another use for Plane 14! :-) -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/