RE: PUA (BMP) planned characters HTML tables
On August 11, I replied to Robert Wheelock: >> I remember that a website that has tables for certain PUA precomposed >> accented characters that aren’t yet in Unicode (thing like: >> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H- >> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). > > If you are thinking of these as potential future additions to the > standard, keep in mind that accented letters that can already be > represented by a combination of letter + accent will not ever be > encoded. This is one of the longest-standing principles Unicode has. I missed the possible significance of the Latvian comma below vs. Marshallese cedilla, which captured most of the ensuing discussion and morphed into a discussion about different user communities and group identity. I'd like to restate, since I think the point may have been lost, that for the OTHER characters Robert mentioned: > H/h-acute, capital T-dieresis, capital H-underbar, acute accented > Cyrillic vowels, Cyrillic ER/er-caron, ... there does not appear to be any conflicting usage between different user communities, and no particular difficulty in rendering or otherwise processing these as combining sequences, using up-to-date fonts and rendering engines. I suppose Philippe's example of Võro might factor into whether different groups prefer different appearances for h́, but otherwise these user-perceived characters seem to be non-controversial. So to reiterate, these characters appear vanishingly unlikely to be atomically encoded, "yet" or ever, for good reason. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 7:49 PM, James Kass via Unicode wrote: On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote: Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. Quickly checked a couple of older on-line PDFs and both used the comma below unabashedly. Quoting from this page (which appears to be more modern than the PDFs), http://www.trussel2.com/MOD/peloktxt.htm "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj kar ..." It seems that users are happy to employ a dot below in lieu of either a comma or cedilla. This newer web page is from a book published in 1978. There's a scan of the original book cover. Although the book title is all caps hand printing it appears that commas were used. The Marshallese orthography which uses commas/cedillas is fairly recent, replacing an older scheme devised by missionaries. Perhaps the actual users have already resolved this dilemma by simply using dots below. That may be the case for Marshallese. But wouldn't surprise me. My comments were based on a different case of the same kinds of diacritics below (other languages) and at the time we consulted typographic samples including newsprint that were using pre-Unicode technologies. In that sense a cleaner case, because there was no influence by what Unicode did or didn't do. Now, having said that, I do get it that some materials, like text books, online class materials etc. need to be prepared / printed using the normative style for the given orthography. But it's a far cry from claiming that all text in a given language is invariably done only one way. A./
Re: PUA (BMP) planned characters HTML tables
On Wed, 14 Aug 2019 23:32:37 + James Kass via Unicode wrote: > U+0149 has a compatibility decomposition. It has been deprecated and > is not rendered identically on my system. > 'n ʼn > ( ’n ) Compatibility decompositions are quite a mix, but are generally expected to render differently. If they were expected to render the same, they would normally be canonical decompositions. U+0149 and its decomposition naturally render very differently with a monospaced font. The same goes for the Roman numerals that the Far East gave us. Richard.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote: Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. Quickly checked a couple of older on-line PDFs and both used the comma below unabashedly. Quoting from this page (which appears to be more modern than the PDFs), http://www.trussel2.com/MOD/peloktxt.htm "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj kar ..." It seems that users are happy to employ a dot below in lieu of either a comma or cedilla. This newer web page is from a book published in 1978. There's a scan of the original book cover. Although the book title is all caps hand printing it appears that commas were used. The Marshallese orthography which uses commas/cedillas is fairly recent, replacing an older scheme devised by missionaries. Perhaps the actual users have already resolved this dilemma by simply using dots below.
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 2:05 AM, James Kass via Unicode wrote: This presumes that the premise of user communities feeling strongly about the unacceptable aspect of the variants is valid. Since it has been reported and nothing seems to be happening, perhaps the casual users aren't terribly concerned. It's also possible that the various user communities have already set up their systems to handle things acceptably by installing appropriate fonts. This is always a good question. Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. For example, some Latin fonts drop the dot on the lowercase i for stylistic reasons (or designers use dotless i in highly designed texts, like book covers, logos, etc.). That's usually not a problem for ordinary users for monolingual texts in, say English; even though everyone agrees that the lowercase i is normally dotted, the absence isn't noticed by most, and tolerated even by those who do notice it. However, as soon as a user community sees a particular variant as signalling their group identity, they will be very vocal about it - even, interestingly enough, in cases where de-facto use (e.g. via font selection, and not forced by implementation defaults) doesn't match that preference. As I said, we've seen this in the past for some features in some languages. Now, which features become strongly identified with group identity is something that subject to change over time; this makes it impossible to guarantee both absolute stability and perfect compatibility; especially if a combining mark that is used in decompositions needs to disunified because the range of shapes changes from being stylistic to normative. Before Unicode, with character sets limited to local use, you couldn't create minimal pairs (except if the variation was part of your language, like Turkish i with/without dot). So, if font deviated and pushed the stylistic envelope, the non-preferred form, if used, would still necessarily refer to the local character; there was no way it could mean anything else. With Unicode, that's changed, and instead of user communities treating this as a typographic issue (exclusive use of preferred font) which is decentralized to document authors (and perhaps font vendors) it becomes a character coding issue that is highly visible and centralized. That in turn can lead to the issue becoming politicized; and not unlike some grammar issues, where the supposedly "correct" form is far from universally agreed on in practice. A./
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 4:32 PM, James Kass via Unicode wrote: If a character gets deprecated, can its decomposition type be changed from canonical to compatibility? Simple answer: No. --Ken
Re: PUA (BMP) planned characters HTML tables
On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote: I think you'd also have to change the reference glyph of LATIN LOWER CASE I WITH HEART to show a heart. That's valid because the UCD trumps the code charts, and and no Unicode-compliant process may deliberately render differently from LATIN LOWER CASE I WITH HEART. U+0149 has a compatibility decomposition. It has been deprecated and is not rendered identically on my system. 'n ʼn ( ’n ) If a character gets deprecated, can its decomposition type be changed from canonical to compatibility?
Re: PUA (BMP) planned characters HTML tables
On Wed, 14 Aug 2019 09:05:02 + James Kass via Unicode wrote: > The solution is to deprecate "LATIN LOWER CASE I WITH HEART". It's > only in there because of legacy. It's presence guarantees > round-tripping with legacy data but it isn't needed for modern data > or display. Urge Groups One and Two to encode their data with the > desired combiner and educate font engine developers about the > deprecation. As the rendering engines get updated, the system > substitution of the wrongly named precomposed glyph will go away. I think you'd also have to change the reference glyph of LATIN LOWER CASE I WITH HEART to show a heart. That's valid because the UCD trumps the code charts, and and no Unicode-compliant process may deliberately render differently from LATIN LOWER CASE I WITH HEART. Richard.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-12 8:30 AM, Andrew West wrote: This issue was discussed at WG2 in 2013 (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), when there was a recommendation to encode precomposed letters L and N with cedilla*with no decomposition*, but that solution does not seem to have been taken up by the UTC. Group One dots their lowercase "i" letters with little flowers and Group Two dots theirs with little hearts. Group Two considers flowers unacceptable and Group One rejects hearts. Because of legacy character sets there's a precomposed character encoded called "LATIN LOWER CASE I WITH HEART", but it was misnamed and is normally drawn with a flower instead. Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING HEART" to get the thing to display properly. But because there's a decomposition involved, the font engine substitutes the glyph mapped to "LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN LOWER CASE I" plus "COMBINING HEART". This thwarts Group Two because they still get the flower. The solution is to deprecate "LATIN LOWER CASE I WITH HEART". It's only in there because of legacy. It's presence guarantees round-tripping with legacy data but it isn't needed for modern data or display. Urge Groups One and Two to encode their data with the desired combiner and educate font engine developers about the deprecation. As the rendering engines get updated, the system substitution of the wrongly named precomposed glyph will go away. This presumes that the premise of user communities feeling strongly about the unacceptable aspect of the variants is valid. Since it has been reported and nothing seems to be happening, perhaps the casual users aren't terribly concerned. It's also possible that the various user communities have already set up their systems to handle things acceptably by installing appropriate fonts.
Re: PUA (BMP) planned characters HTML tables
On Mon, 12 Aug 2019 at 02:27, James Kass via Unicode wrote: > > On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: > > If you are thinking of these as potential future additions to the standard, > > keep in mind that accented letters that can already be represented by a > > combination of letter + accent will not ever be encoded. This is one of the > > longest-standing principles Unicode has. People seem to be ignoring the fact that Marshallese and Latvian both use L and N with cedilla, but with completely different glyph shapes: > In January 2013, the Unicode Technical Committee discussed issues for the > representation of > Marshallese orthography. In particular, Marshallese uses the Latin script and > requires the letters l, > m, n, and o with cedilla. Latvian orthography uses the Latin script and > requires the letters g, k, l, n, > and r with comma below. For Marshallese, it is unacceptable to display > cedillas as commas below. > Conversely, for Latvian, it is unacceptable to display commas below as > cedillas. However, as fonts have been following Latvian practice for these letters (cedilla is displayed as a comma below) since before Unicode, Marshallese users cannot get their desired outcome using standard Unicode combining diacritical marks unless they apply a font specially designed for Marshallese -- which you can never guarantee if you are writing an email or posting on twitter, etc. This issue was discussed at WG2 in 2013 (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), when there was a recommendation to encode precomposed letters L and N with cedilla *with no decomposition*, but that solution does not seem to have been taken up by the UTC. Andrew
Re: PUA (BMP) planned characters HTML tables
On Mon, 12 Aug 2019 01:21:42 + James Kass via Unicode wrote: > There was a time when populating the PUA with precomposed glyphs was > necessary for printing or display, but that time has passed. There is still the issue that in pure X one can't put sequences of characters on a key; if the application doesn't invoke an input method one is stuck. Useful 20-year old proprietary code may be totally unable to use modern font capabilities. Don't forget the Cobol Y10k joke. On Ubuntu at least, there was a period when Emacs couldn't access X-based input methods from an English locale. The work-around: Use a Japanese locale plus the vanilla lack of internationalisation in the interface, or Emacs's very convenient alternative keyboard capability for text input as opposed to commands. The bug turned out to be in the definition of the locales, i.e. in privileged data beyond the purview of Emacs. As to the need for the PUA, writing fonts to cope with Tai Tham rendering engines is not easy, and it's no surprise that the PUA is used on line for a newspaper that uses the Tai Tham script. The USE is too user-hostile for it to have helped if it had been available earlier. (It just ignored the regular expression published in 2007. (It's in L2/07-007R in the UTC document register, ISO/IEC JTC1/SC2/WG2/N3207R on ISO land.) Indeed, perhaps I should be researching the PUA encoding for Tai Tham. (My Tai Tham font Da Lekh started as proof of principle, for there is already an unpleasant amount of glyph sequence changing, some style-dependent. I couldn't see how to get rendering engine support even when it might be added. I was pleasantly surprised at how far from impossible Tai Tham layout was until the USE came along and made everything harder. I now have to work out which glyph instances have already been Indicly rearranged when I repair the clustering.) Oh, and i seem to need some PUA codepoints for vowels that get stranded when line-breaks occur between the columns of an akshara. The proposals show this phenomenon in old(?) Pali text. Or is there any chance of getting them encoded? Richard.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. Good point. There was a time when populating the PUA with precomposed glyphs was necessary for printing or display, but that time has passed. Hopefully anyone seeking charts is transcoding older data into proper Unicode. This can be illustrated with the Marshallese combos mentioned earlier. PUA: Standard: ĻļM̧m̧ŅņO̧o̧ Well, that didn't work out as well as expected. But the standard Unicode is supported (more or less) by some of the core fonts installed here. Nothing installed here displays anything useful for the PUA characters. A decent OpenType font designed with Marshallese in mind should work just fine with the combiners. The fact is that the standard characters will survive and can be universally exchanged. And there's plenty of web page charts showing the standard characters.
RE: PUA (BMP) planned characters HTML tables
Robert Wheelock wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren’t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H- > underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: PUA (BMP) planned characters HTML tables
On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote: Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren’t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! It sounds familiar but I can't place it. I tried the SIL pages first, as did Richard Wordingham apparently. https://blogfonts.com/dehuti.font This font has material in the PUA including: Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N (E3CE & E3DE), O (E429 & E465) These appear to be PUA characters which the font developer has mapped in addition to the SIL PUA mappings.
Re: PUA (BMP) planned characters HTML tables
On Sun, 11 Aug 2019 00:07:05 -0400 Robert Wheelock via Unicode wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren’t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital > H-underbar, acute accented Cyrillic vowels, Cyrillic > ER/er-caron, ...). Where was it at?! I still want to get the > information. Thank You! You may mean https://www.eki.ee/letter. Once there, you'll want to make a query by Unicode range, e.g. e000-f8ff. It doesn't seem to refer to the relevant agreement. You could start hunting for agreements at https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA Most of the characters you mention are scheduled to be assigned their own codepoint on the Greek kalends. They are precluded by policy because they would need to be composition exclusions to avoid making text in NFC cease to be in NFC. I first thought of the SIL PUA at https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=PUA_home , but they knew better than to include most of them. Richard.
RE: PUA (BMP) planned characters HTML tables
Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren’t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! Robert Lloyd Wheelock