Re: Private Use areas
Hi I have now found the following document. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf William Overington Friday 31 August 2018 Original message >From : wjgo_10...@btinternet.com Date : 2018/08/31 - 21:43 (GMTDT) To : m...@kli.org, unicode@unicode.org Subject : Re: Private Use areas Hi Thank you for your posts from earlier today. Actually I learned about JSON yesterday and I am thinking that using JSON could well be a good idea. I found a helpful page with diagrams. http://www.json.org/ Although I hope that a format of recording information about the properties of particular uses of Private Use Area characters will become implemented as a practicality, and that that format can be applied in practice where desired, and indeed I would be happy to participate in a group project, I do not know enough about Unicode properties to play a major role or to lead such a project. William Overington Friday 31 August 2018
Re: Private Use areas
Hi Thank you for your posts from earlier today. Actually I learned about JSON yesterday and I am thinking that using JSON could well be a good idea. I found a helpful page with diagrams. http://www.json.org/ Although I hope that a format of recording information about the properties of particular uses of Private Use Area characters will become implemented as a practicality, and that that format can be applied in practice where desired, and indeed I would be happy to participate in a group project, I do not know enough about Unicode properties to play a major role or to lead such a project. William Overington Friday 31 August 2018
Re: Private Use areas
On 08/28/2018 04:26 AM, William_J_G Overington via Unicode wrote: Hi Mark E. Shoulson wrote: I'm not sure what the advantage is of using circled characters instead of plain old ascii. My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters. What if circled characters are used in the text encoded in the file? They're characters too, people use them and all. Whenever you designate some characters to be used in a way outside their normal meaning, you have the problem of how to use them *with* their normal meaning. So there are various escaping schemes and all. So in XML, all characters have their normal meanings—except <, >, and &, which mean something special and change the interpretations of other nearby characters (so "bold" is a word in English that appears in the text, but "" is part of an instruction to the renderer that doesn't appear in the text.) And the price is that those three characters have to be expressed differently (< > &). I don't really see what you gain by branding some large swath of unicode ("circled characters") as "special" and not meaning their usual selves, and for that matter making these hard-to-type characters *necessary* for using your scheme, when you could do something like what XML does, and say "everything between < and > is to be interpreted specially, and there, these characters have the following meanings" and then have some other way of expressing those two reserved characters. (not saying you need to do it XML's way, but something like that: reserve a small number of characters that have to be escaped, not some huge chunk.) My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format. That's another way of saying that this is a markup format which accepts a large variety of plain texts. Because you ARE talking about making a "particular markup format," just a different and new one. I guess there's not even any reason for me to argue the point, though, since it is up to you how to design your markup language, and you can take advice (or not) from anyone you like. Draw up some design, find some interested people, start a discussion, and work it out. (but not here; this list is for discussing Unicode.) ~mark
Re: Private Use areas
On 08/28/2018 11:58 AM, William_J_G Overington via Unicode wrote: Asmus Freytag wrote: There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome. I am thinking of such an ad-hoc special purpose markup language. I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display. That starts to sound no longer "ad-hoc", but that is not a well-defined term anyway. You're essentially describing a special-purpose markup language or protocol, or perhaps even programming language. Which is quite reasonable; you should (find some other interested people and) work out some of the details and start writing up parsers and such I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property. It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10. I still don't see why you're fixated on using circled characters. You're already dealing with a markup-language type setup, why not do what other markup schemes do? You reserve three or four characters and use them to designate when other characters are not being used in their normal sense but are being used as markup. In XML, when characters are inside '<>' tags, they are not "plain text" of the document, but they mean other things—perhaps things like "right-to-left" or "reference this web page" and so forth, which are exactly the kinds of things you're talking about here. If you don't want to use plain ascii characters because then you couldn't express plain ascii in your text, you're left with exactly the same problem with circled characters: you can't express circled characters in your text. While that is a smaller problem, it can be eliminated altogether by various schemes used by XML or RTF or lightweight markup languages. Reserve a few special characters to give meanings to the others, and arrange for ways to escape your handful of reserved characters so you can express them. More straightforward to say "you have to escape <, >, and & characters" than to say "you have to escape all circled characters." Anyway, this is clearly a whole new high-level protocol you need (or want) to work out, which would *use* Unicode (just like XML and JSON do), but doesn't really affect or involve it (Unicode is all about the "plain text". Kind of getting off-topic, but get some people interested and start a mailing list to discuss it. Good luck! ~mark
Re: CLDR (was: Private Use areas)
On 31/08/18 07:27 Janusz S. Bień via Unicode wrote: […] > > Given NamesList.txt / Code Charts comments are kept minimal by design, > > one couldn’t simply pop them into XML or whatever, as the result would be > > disappointing and call for completion in the aftermath. Yet another task > > competing with CLDR survey. > > Please elaborate. It's not clear for me what do you mean. These comments are designed for the Code Charts and as such must not be disproportionate in exhaustivity. Eg we have lists of related languages ending in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt to be fed in an extensible and unconstrained format (without any constraint as of available space, number and length of comments, and so on), any lack is felt as a discriminating neglect, and there will be a huge rush adding data. Yet Unicode hasn’t set up products where that data could be published, ie not in the Code Charts (for the abovementioned reason), not in ICU so far as the additional information involved does not match a known demand on user side (localizing software does not mean providing scholarly exhaustive information about supported characters). The use will be in character pickers providing every available information about a given character. That is why Unicode is to prioritize CLDR for CLDR users, rather than extra information for the web. > > > Reviewing CLDR data is IMO top priority. > > There are many flaws to be fixed in many languages including in English. > > A lot of useful digest charts are extracted from XML there, > > Which XML? where? More precisely it is LDML, the CLDR-specific XML. What I called “digest charts” are the charts found here: http://www.unicode.org/cldr/charts/34/ The access is via this page: http://cldr.unicode.org/index/downloads where the charts are in the Charts column, while the raw data is under SVN Tag. > > > and we really > > need to go through the data and correct the many many errors, please. > > Some time ago I tried to have a close look at the Polish locale and > found the CLDR site prohibitively confusing. I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive for the access to the XML data (except when knowing about SubVersioN). Polish data is found here: https://www.unicode.org/cldr/charts/34/summary/pl.html The access is via the top of the "Summary" index page (showing root data): https://www.unicode.org/cldr/charts/34/summary/root.html You may wish to particularly check the By-Type charts: https://www.unicode.org/cldr/charts/34/by_type/index.html Here I’d suggest to first focus on alphabetic information and on punctuation. https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html Under Latin (table caption, without anchor) we find out what punctuation Polish has compared to other locales using the same script. The exact character appears when hovering the header row. Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is an error in almost every locale using hyphen. TC is about to correct that. Further you will see that while Polish is using apostrophe https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish CLDR does not have the correct apostrophe for Polish, as opposed eg to French. You may wish to note that from now on, both U+0027 APOSTROPHE and U+0022 QUOTATION MARK are ruled out in almost all locales, given the preferred characters in publishing are U+2019 and, for Polish, the U+201E and U+201D that are already found in CLDR pl. Note however that according to the information provided by English Wikipedia: https://en.wikipedia.org/wiki/Quotation_mark#Polish Polish also uses single quotes, that by contrast are still missing in CLDR. Now you might understand what I meant when pointing that there are still many errors in many languages in CLDR, including in English. Best regards, Marcel > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien > >
Re: CLDR (was: Private Use areas)
The XML files in these folders: https://unicode.org/repos/cldr/tags/latest/common/ But I agree. I spent an extreme amount of time to get somewhat used to cldr.unicode.org and and the data repo, and still I have no clue, where to find a concrete piece of information without digging into the site. Am Fr., 31. Aug. 2018 um 07:22 Uhr schrieb Janusz S. Bień via Unicode : > > On Thu, Aug 30 2018 at 2:27 +0200, unicode@unicode.org writes: > > [...] > > > Given NamesList.txt / Code Charts comments are kept minimal by design, > > one couldn’t simply pop them into XML or whatever, as the result would be > > disappointing and call for completion in the aftermath. Yet another task > > competing with CLDR survey. > > Please elaborate. It's not clear for me what do you mean. > > > Reviewing CLDR data is IMO top priority. > > There are many flaws to be fixed in many languages including in English. > > A lot of useful digest charts are extracted from XML there, > > Which XML? where? > > > and we really > > need to go through the data and correct the many many errors, please. > > Some time ago I tried to have a close look at the Polish locale and > found the CLDR site prohibitively confusing. > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien >
CLDR (was: Private Use areas)
On Thu, Aug 30 2018 at 2:27 +0200, unicode@unicode.org writes: [...] > Given NamesList.txt / Code Charts comments are kept minimal by design, > one couldn’t simply pop them into XML or whatever, as the result would be > disappointing and call for completion in the aftermath. Yet another task > competing with CLDR survey. Please elaborate. It's not clear for me what do you mean. > Reviewing CLDR data is IMO top priority. > There are many flaws to be fixed in many languages including in English. > A lot of useful digest charts are extracted from XML there, Which XML? where? > and we really > need to go through the data and correct the many many errors, please. Some time ago I tried to have a close look at the Polish locale and found the CLDR site prohibitively confusing. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
> > On 29 August 2018 at 06:47 "Janusz S. Bień via Unicode" > wrote: > > > > > > Storing this information in a font, by hook or crook, would lock > > users > > of those PUA characters into that font. At that rate, you might as > > well > > use ASCII-hacked fonts, as we did 25 years ago. > > > > > I don't see that at all. The obvious way in the sfnt format, used by OpenType, is as a table consisting entirely of the XML file. It is quite easy to add a table to an unsigned sfnt font, and even easier to extract a table consisting entirely of UTF-8 text, though ASCII would be even easier, from a font file. > > Storing the information in a font is inappropriate not only for > thetechnical reasons, as I wrote recently (on Thu, Aug 23 2018) > > > > > > Fonts are for *rendering*, new characters and variants are more and > > more often needed for *input* of real life old texts with sufficient > > precision. > > > > > 1. There are existing methods of associating a font with a text. Not using a font needs a new scheme for associating a set of PUA properties with a portion of a file. The font also serves as a code chart. It can also hold information on how characters combine, which is notoriously beyond the capability of code charts. 2. Registries can vanish. 3. In practice, a file needs to retain an association with a specialist font. Preserving the font should preserve its content, but there are pruning techniques (e.g. WOFF2) that may remove this content. Richard.
Re: Private Use areas
On 29/08/18 07:55, Janusz S. Bień via Unicode wrote: > > On Tue, Aug 28 2018 at 9:43 -0700, unicode@unicode.org writes: > > On August 23, 2011, Asmus Freytag wrote: > > > >> On 8/23/2011 7:22 AM, Doug Ewell wrote: > >>> Of all applications, a word processor or DTP application would want > >>> to know more about the properties of characters than just whether > >>> they are RTL. Line breaking, word breaking, and case mapping come to > >>> mind. > >>> > >>> I would think the format used by standard UCD files, or the XML > >>> equivalent, would be preferable to making one up: […] > >> > >> The right answer would follow the XML format of the UCD. > >> > >> That's the only format that allows all necessary information contained > >> in one file, > > For me necessary are also comments and crossreferences contained in > NamesList.txt. Do I understand correctly that only "ISO Comment > properties" are included in the file? Even that comment field is obsoleted. But it’s unclear to me what exactly it was providing from ISO. > > >> and it would leverage of any effort that users of the > >> main UCD have made in parsing the XML format. > >> > >> An XML format shold also be flexible in that you can add/remove not > >> just characters, but properties as needed. > >> > >> The worst thing do do, other than designing something from scratch, > >> would be to replicate the UnicodeData.txt layout with its random, but > >> fixed collection of properties and insanely many semi-colons. None of > >> the existing UCD txt files carries all the needed data in a single > >> file. Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible. I never wondered why the header line is missing, probably because compared to the other UCD files, the file looks really odd without a file header showing at least the version number and datestamp. It’s like the file was made up for dumb parsers unable to handle comment delimiters, and never to be upgraded to do so. But I like the format, and that’s why at some point I submitted feedback asking for an extension. Indeed we could use more information than what is yielded by UCD \setminus NamesList.txt (that we may not parse, as per file header). Given NamesList.txt / Code Charts comments are kept minimal by design, one couldn’t simply pop them into XML or whatever, as the result would be disappointing and call for completion in the aftermath. Yet another task competing with CLDR survey. Reviewing CLDR data is IMO top priority. There are many flaws to be fixed in many languages including in English. A lot of useful digest charts are extracted from XML there, and we really need to go through the data and correct the many many errors, please. Unlike XML, human readability of CSV may not be immediate. Yes you simply cannot always count the semicolons and remember the property name from the value position if it isn’t obvious by itself. But we use spreadsheets. At least some people do. That’s where the magic works. Looking up things in a spreadsheet is a good way to find out about wrong property values. Looks like handling files only programmatically gets everything screwed up. Marcel
Re: Private Use areas - Vertical Text
> > On 29 August 2018 at 13:05 Andrew West via Unicode > wrote: > > I tested with Word 2007, and normal PUA characters from my font were > > displayed with vertical orientation in a vertical text box, but Plane > 15 PUA characters were rotated. > And then the original question is whether a font can suppress this rotation. For example, it is entirely possible that the rotation could be eliminated by the vrt2 OpenType feature mapping a Zhuang PUA glyph to an identical glyph. Richard.
Re: Private Use areas - Vertical Text
On Wed, 29 Aug 2018 at 11:18, wrote: > > I was using a change horizontal to vertical text feature in office, the > PUA characters being from plane 15. I tested with Word 2007, and normal PUA characters from my font were displayed with vertical orientation in a vertical text box, but Plane 15 PUA characters were rotated. I also tested with Word 2016, and both normal PUA characters and Plane 15 PUA characters were displayed with vertical orientation in a vertical text box, as you want, although there were vertical spacing issues with the Plane 15 PUA characters which suggest that the vertical metrics tables (vhea and vmtx) in the font are not being applied for Plane 15 characters (or it could be a problem with my font). Andrew
Re: Private Use areas - Vertical Text
Dear Andrew, I was using a change horizontal to vertical text feature in office, the PUA characters being from plane 15. Regards John On 2018-08-29 16:32, Andrew West via Unicode wrote: On Wed, 29 Aug 2018 at 05:07, via Unicode wrote: Yes, as Richard says when CJK Zhuang text is displayed vertically whilst the Zhuang characters in Unicode remain upright, but those with PUA codepoints are rotated 90°. John, you did not explain by what mechanism you were trying to display vertical PUA Zhuang text. I can display vertically-oriented PUA-encoded CJKVZ ideographs in vertical layout in web pages using CSS, as demonstrated in this test page: http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html The PUA characters display with correct orientation under Windows 10 on the Edge, Chrome and Firefox browsers. The test page only fails under IE, but we are not meant to use IE anymore anyway. Andrew
Re: Private Use areas - Vertical Text
On Wed, 29 Aug 2018 at 05:07, via Unicode wrote: > > Yes, as Richard says when CJK Zhuang text is displayed vertically whilst > the Zhuang characters in Unicode remain upright, but those with PUA > codepoints are rotated 90°. John, you did not explain by what mechanism you were trying to display vertical PUA Zhuang text. I can display vertically-oriented PUA-encoded CJKVZ ideographs in vertical layout in web pages using CSS, as demonstrated in this test page: http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html The PUA characters display with correct orientation under Windows 10 on the Edge, Chrome and Firefox browsers. The test page only fails under IE, but we are not meant to use IE anymore anyway. Andrew
Re: Private Use areas - Vertical Text
On Tue, 28 Aug 2018 at 18:15, WORDINGHAM RICHARD via Unicode wrote: > > Unicode is doing what it can in this matter: > > (a) Zhuang PUA characters are being made individually obsolete. Not by a nebulous entity called "Unicode", or even by the Unicode Consortium per se, but by the hard work over many years by individual experts such as John Knightley. Andrew
Re: Private Use areas - Vertical Text
John Knightley wrote, > Yes, as Richard says when CJK Zhuang text is displayed > vertically whilst the Zhuang characters in Unicode remain > upright, but those with PUA codepoints are rotated 90°. > This is because the PUA characters are treated like English > text, which are correctly rotated 90°. ... > > ... > ... the need for PUA Zhuang characters remains, and will > so for decades to come. A possible work-around would be to have two fonts for PUA Zhuang, one for horizontal text and one for vertical. The one for the vertical text would have the glyphs in the font pre-rotated 90° anti-clockwise. This would require font switching when switching from horizontal to vertical layout, of course.
Re: Private Use areas
On Tue, Aug 28 2018 at 9:43 -0700, unicode@unicode.org writes: > On August 23, 2011, Asmus Freytag wrote: > >> On 8/23/2011 7:22 AM, Doug Ewell wrote: >>> Of all applications, a word processor or DTP application would want >>> to know more about the properties of characters than just whether >>> they are RTL. Line breaking, word breaking, and case mapping come to >>> mind. >>> >>> I would think the format used by standard UCD files, or the XML >>> equivalent, would be preferable to making one up: Right. I was not so quick to state this so early, but 2 years ago I wrote to the MUFI list: --8<---cut here---start->8--- On Sat, Jan 02 2016 at 12:35 CET, odd.hau...@uib.no writes: [...] > Note the permanent URI at the University Library in Bergen. This will > in all likelihood be the last recommendation of its kind (and > certainly the last edited by the undersigned), so please look out for > new solutions (databases or the like) on the MUFI web site! I think that one of the forms, perhaps even the primary one, should follow the original Unicode Character Database and the output of Unibook (http://www.unicode.org/unibook/). The idea can be tested by converting the present recommendation to this form. Unfortunately I'm unable to contribute myself to this task. One of the advantages would be that the various character browsers can be adapted relatively easily to provide info about the MUFI characters. A simpler variant of this idea is to use Unibook-like format to document fonts. A quick-and-dirty tools for this purpose has been prepared by a student of mine: https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ https://bitbucket.org/jsbien/unicode-ucd-parser A sample output of the tools is available at https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf (the font is also quick-and-dirty and unfinished work). --8<---cut here---end--->8--- Unfortunately there was no reaction. >> >> The right answer would follow the XML format of the UCD. >> >> That's the only format that allows all necessary information contained >> in one file, For me necessary are also comments and crossreferences contained in NamesList.txt. Do I understand correctly that only "ISO Comment properties" are included in the file? >> and it would leverage of any effort that users of the >> main UCD have made in parsing the XML format. >> >> An XML format shold also be flexible in that you can add/remove not >> just characters, but properties as needed. >> >> The worst thing do do, other than designing something from scratch, >> would be to replicate the UnicodeData.txt layout with its random, but >> fixed collection of properties and insanely many semi-colons. None of >> the existing UCD txt files carries all the needed data in a single >> file. > > I don't know if or how I responded 7 years ago, but at least today, I > think this is an excellent suggestion. > > If the goal is to encourage vendors to support PUA assignments, using an > exceedingly well-defined format (UAX #42) sitting atop one of the most > widely used base formats ever (XML), with all property information in a > single repository (per PUA scheme), would be great encouragement. I think we need also the data in the format acceptable by UniBook. > I've devised lots of novel file formats and I think this is one use > case where that would be a real hindrance. > Storing this information in a font, by hook or crook, would lock users > of those PUA characters into that font. At that rate, you might as well > use ASCII-hacked fonts, as we did 25 years ago. Storing the information in a font is inappropriate not only for the technical reasons, as I wrote recently (on Thu, Aug 23 2018) > Fonts are for *rendering*, new characters and variants are more and > more often needed for *input* of real life old texts with sufficient > precision. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
RE: Private Use areas - Vertical Text
Dear Richard and Peter, apologies for the lack of clarity. Let me try to explain below. On 2018-08-29 01:13, WORDINGHAM RICHARD via Unicode wrote: On 27 August 2018 at 15:22 Peter Constable via Unicode wrote: Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case). Cf. UAX 50. There have been some pretty confused statements. I believe the observed problem is that PUA characters for Zhuang CJK ideographs get rotated when displayed vertically rather than left-to-right. Yes, as Richard says when CJK Zhuang text is displayed vertically whilst the Zhuang characters in Unicode remain upright, but those with PUA codepoints are rotated 90°. This is because the PUA characters are treated like English text, which are correctly rotated 90°. The orientation of the CJK characters in this case appears to depend on which block they belong to. As Peter points out this does not seem to match UAX 50. Unicode is doing what it can in this matter: (a) Zhuang PUA characters are being made individually obsolete. Yes and No. Whilst a thousand Zhuang characters have been enocoded and two thousand have been submitted via IRG, however the number of PUA Zhuang characters is about the same or increasing. In 2006 when started just under 6k PUA points were used, presently there are over 8k, over 6k of which have not been submitted, and the earliest any future submissions can be encoded is 2026. That being said the number of more common Zhuang characters needing PUA support is coming down. So whilst individual characters are being resolved, the need for PUA Zhuang characters remains, and will so for decades to come. (b) By default, PUA characters have the value of Vertical_orientation=upright as do CJK ideographs. Noted above. Regards John For CJK ideographs, it is not clear to me when the vert feature (if present) would be applied. Is it only for some codepoints (vo=tu), or is it for all that the engine expects to be displayed 'upright' in vertical text? The vrtr feature (if present) would be applied when glyphs are to be rotated. Is it for all such glyphs, or only those for which rotation is expected to be inadequate (vo=tr)? It seems that feature vrt2 is to be applied to all glyphs; perhaps rotation is the default behaviour when there is no look-up value for a glyph that the engine expects to be rotated. The truly difficult case would be when there is no attempt to apply a look-up - possibly vrtr would not apply to /p{vo=r}. I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to themselves (or something prerotated) would cure the problem. This would not work for sequences of Zhuang ideographs treated as RTL text - but that is unlikely to happen. Richard.
RE: Private Use areas - Vertical Text
> > On 27 August 2018 at 15:22 Peter Constable via Unicode > wrote: > > Layout engines that support CJK vertical layout do not rely on the 'vert' > feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° > and switch to using vertical glyph metrics. The 'vert' feature is used to > substitute vertical alternate glyphs as needed, such as for punctuation that > isn't automatically rotated (and would probably need a differently-positioned > alternate in any case). > > Cf. UAX 50. > There have been some pretty confused statements. I believe the observed problem is that PUA characters for Zhuang CJK ideographs get rotated when displayed vertically rather than left-to-right. Unicode is doing what it can in this matter: (a) Zhuang PUA characters are being made individually obsolete. (b) By default, PUA characters have the value of Vertical_orientation=upright as do CJK ideographs. For CJK ideographs, it is not clear to me when the vert feature (if present) would be applied. Is it only for some codepoints (vo=tu), or is it for all that the engine expects to be displayed ‘upright’ in vertical text? The vrtr feature (if present) would be applied when glyphs are to be rotated. Is it for all such glyphs, or only those for which rotation is expected to be inadequate (vo=tr)? It seems that feature vrt2 is to be applied to all glyphs; perhaps rotation is the default behaviour when there is no look-up value for a glyph that the engine expects to be rotated. The truly difficult case would be when there is no attempt to apply a look-up – possibly vrtr would not apply to /p{vo=r}. I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to themselves (or something prerotated) would cure the problem. This would not work for sequences of Zhuang ideographs treated as RTL text - but that is unlikely to happen. Richard.
Re: Private Use areas
On August 23, 2011, Asmus Freytag wrote: > On 8/23/2011 7:22 AM, Doug Ewell wrote: >> Of all applications, a word processor or DTP application would want >> to know more about the properties of characters than just whether >> they are RTL. Line breaking, word breaking, and case mapping come to >> mind. >> >> I would think the format used by standard UCD files, or the XML >> equivalent, would be preferable to making one up: > > The right answer would follow the XML format of the UCD. > > That's the only format that allows all necessary information contained > in one file, and it would leverage of any effort that users of the > main UCD have made in parsing the XML format. > > An XML format shold also be flexible in that you can add/remove not > just characters, but properties as needed. > > The worst thing do do, other than designing something from scratch, > would be to replicate the UnicodeData.txt layout with its random, but > fixed collection of properties and insanely many semi-colons. None of > the existing UCD txt files carries all the needed data in a single > file. I don't know if or how I responded 7 years ago, but at least today, I think this is an excellent suggestion. If the goal is to encourage vendors to support PUA assignments, using an exceedingly well-defined format (UAX #42) sitting atop one of the most widely used base formats ever (XML), with all property information in a single repository (per PUA scheme), would be great encouragement. I've devised lots of novel file formats and I think this is one use case where that would be a real hindrance. Storing this information in a font, by hook or crook, would lock users of those PUA characters into that font. At that rate, you might as well use ASCII-hacked fonts, as we did 25 years ago. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Private Use areas
Asmus Freytag wrote: > There are situations where an ad-hoc markup language seems to fulfill a need > that is not well served by the existing full-fledged markup languages. You > find them in internet "bulletin boards" or services like GitHub, where pure > plain text is too restrictive but the required text styles purposefully > limited - which makes the syntactic overhead of a full-featured mark-up > language burdensome. I am thinking of such an ad-hoc special purpose markup language. I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display. I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property. It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10. I am wondering how many PUA property variables there would need to be set for the system to be useful. The sequence could start with all of those PUA property values set at their default values so only those that needed changing need be explicitly set, though others could be explicitly set to the default values if a record were desired. William Overington Tuesday 28 August 2018
Re: Private Use areas
James Kass wrote: > Non-conformant? Well, it's probably overkill anyway. A simpler method of > identifying which PUA convention is being used for a file would be to either have the first line of the file being something like [PUA1] or to have the file name be something like MYFILE.TXTPUA1. Where "PUA1" equals the CSUR. Other numbers (PUA2, PUA3, etc.) for other PUA conventions. The problem that then arises is that a registry is needed for what those numbers mean, such as PUA01728. So what if someone writes explaining his designs for glyphs for the language of the people who live in the northern part of the fifth planet from the sun in the science fiction novel he is writing? Is registration granted instantly upon request or is there a threshold of some sort? What if lots of people do that, including some people wanting a registry code number for the various emoji that they want? If there is a threshold of proving usage and so on, or of showing that the designs have been produced AT a business or AT a college or whatever, then the system will only work for some users of the Private Use Areas. My opinion is that the system needs to be free-standing, with each usage possibly self-contained or with an external reference to a document that is available. Care would need to be taken to send a copy of any such document to deposit libraries such as The British Library so as to ensure long-term conservation. William Overington Tuesday 28 August 2018
Re: Private Use areas
Hi Mark E. Shoulson wrote: > I'm not sure what the advantage is of using circled characters instead of > plain old ascii. My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters. My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format. William Overington Tuesday 28 August 2018
Re: Private Use areas
On 8/27/2018 2:20 PM, Rebecca Bettencourt via Unicode wrote: > That sounds like a non-conformant use of characters in the U+24xx block. Well, you are an expert on these things and I do not understand as to with what it would be non-conformant. A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ and not as a signal to process what follows as anything other than plain text. Not correct. If that was literally true, then all HTML, XML, CSS, C, C#, Java, Python source code files and their compilers would be non-conformant. It's more like, "if a process treats a sequence of bytes as Unicode plain text, then the bytes corresponding to the codes assigned to ⓅⓊⒶⒹⒶⓉⒶ just stand for ⓅⓊⒶⒹⒶⓉⒶ. Any meaning is imparted by the (human) reader." However, if the process treats the file as a source file in a markup language, there's nothing that prevents it from assigning particular interpretations to ⓅⓊⒶⒹⒶⓉⒶ, including, but not limited to not displaying these code points as characters. The interpretation of the remainder of the file may well be conformant to the Unicode Standard, just as the display of the contents of many HMTL elements is usually conformant to the Unicode Standard. What you are proposing is a higher-level protocol, whether you realize it or not. Correct, the rub here is that all these schemes that treat characters as both syntax and text depending on context amount to mark-up languages and are therefore ipso-facto no longer plain text (except if displayed as source code, but already applying syntax coloring would no longer be purely treating the data as plain text). In-band markup has thus a dual nature as plain text and rich text, depending on how it is processed. Unfortunately your higher-level protocol has a serious flaw in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". That could probably be remedied by the usual techniques. Also, seeing a bunch of circled alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ. :) There are plenty of already-existing higher-level protocols (you mentioned one: XML) that could be used to provide information about PUA characters, and they are all much better suited to that purpose than what you are proposing. There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome. Too bad that there's been no "winner" among these, and therefore no universally accepted one. If so, it might have presented an obvious target for a PUA extension. A./
Re: Private Use areas
But there's nothing wrong with proposing a higher-level protocol; indeed, that's what Ken Whistler was saying: you need a protocol to transmit this information. It's metadata, so it will perforce be a higher-level protocol of some kind, whether transmitting actually out-of-band or reserving a piece of the file for metadata. That's fine. I'm not sure what the advantage is of using circled characters instead of plain old ascii. You have to set off your reserved area somehow, and I don't think using circled chars is the least obtrusive way to do it. You could use XML; that would be pretty well-suited to the task, but maybe it's overkill. If all you need is to reference some "standard" PUA interpretation (per James Kass' take on this, not William Overington's), then just a header like "[PUA1]" would work just fine. (Compare emacs with things like "-*- encoding: utf-8 -*-" or whatever.) For larger chunks of meta-info, XML might be a good choice, but even then, it could be an XML *header* to an otherwise ordinary text file. Yes, you'd have to delimit it somehow, and probably have a top header (a "magic number") to signal the protocol, but that's doable. For applications not supporting this protocol, such a setup is probably easier for the eye to skip past (even if it's long) than a bunch of circled letters. A protocol like that is outside of Unicode's scope (just like XML is), but it's certainly something you could write up and try to standardize and get used, with or without the support of ISO. People are coming up with file formats all the time (and if you really want to used circled characters, go ahead. That's something for you to consider in the design phase of the project). ~mark On 08/27/2018 05:20 PM, Rebecca Bettencourt via Unicode wrote: > That sounds like a non-conformant use of characters in the U+24xx block. Well, you are an expert on these things and I do not understand as to with what it would be non-conformant. A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ and not as a signal to process what follows as anything other than plain text. What you are proposing is a higher-level protocol, whether you realize it or not. Unfortunately your higher-level protocol has a serious flaw in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". Also, seeing a bunch of circled alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ. There are plenty of already-existing higher-level protocols (you mentioned one: XML) that could be used to provide information about PUA characters, and they are all much better suited to that purpose than what you are proposing.
Re: Private Use areas
On 08/27/2018 05:18 PM, James Kass via Unicode wrote: William Overington wrote, On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington wrote: Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. ⓌⒽⓎ◯ⓃⓄⓉ◯ⓊⓈⒺ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ◯ ⒻⓄⓇ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ? And what's wrong with the ASCII digits? ~mark
Re: Private Use areas
James Kass wrote: > If a user has thousands of files using PUA characters, and all the files are > using the same PUA convention, why would each file need to contain metadata > for each PUA character used within? (Rhetorical) Because each such file would then be self-contained and free-standing. Such metadata need not necessarily be a huge quantity of data. William Overington Monday 27 August 2018
Re: Private Use areas
> > > That sounds like a non-conformant use of characters in the U+24xx block. > > Well, you are an expert on these things and I do not understand as to with > what it would be non-conformant. > > A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ and not as a signal to process what follows as anything other than plain text. What you are proposing is a higher-level protocol, whether you realize it or not. Unfortunately your higher-level protocol has a serious flaw in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". Also, seeing a bunch of circled alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ. There are plenty of already-existing higher-level protocols (you mentioned one: XML) that could be used to provide information about PUA characters, and they are all much better suited to that purpose than what you are proposing.
Re: Private Use areas
William Overington wrote, On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington wrote: > Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters > U+24B6 .. U+24E9. > > Use U+2473 as if it were a circled space. ⓌⒽⓎ◯ⓃⓄⓉ◯ⓊⓈⒺ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ◯ ⒻⓄⓇ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ?
Re: Private Use areas
Here is the reply that I sent to Peter Constable and to the other people to whom he wrote. Unlike for Mr Constable and for many other people, all of my posts have to be passed by the moderator, and I know why that is the situation. Though that situation was not imposed by a named official of Unicode Inc. acting in a stated official capacity. So my opportunities to defend my ideas are conditional. William Overington Monday 27 August 2018 Original message >From : wjgo_10...@btinternet.com Date : 2018/08/27 - 21:18 (GMTDT) To : beckie...@gmail.com, verd...@wanadoo.fr, peter...@microsoft.com, wjgo_10...@btinternet.com, m...@kli.org, kenwhist...@att.net, richard.wording...@ntlworld.com, jameskass...@gmail.com Subject : Re: Private Use areas Well, it is a pity that you did not send your reply to the Unicode mailing list. > That sounds like a non-conformant use of characters in the U+24xx block. Well, you are an expert on these things and I do not understand as to with what it would be non-conformant. It seems to me that for many years some people have wanted a way to convey information about the meaning of Private Use Area characters used in a document in an unobtrusive way within the document. The format that I am suggesting could be the basis of a way to do that. I really do not understand the problem. Ken Whistler wrote: >>> > 1. Define a *protocol* for reliable interchange of custom character >>> > property information about PUA code points. Some people use XML for things where two characters are used in a different manner. A quick downbeat quip comment about my ideas with no explanation is not helpful and might because of your standing cause some people not to consider the idea even-handedly for concern of offending you. I am reminded of a British film of the 1955 called The Colditz Story. It used to be one of the regular films on the television years ago. I do not know whether it was ever shown in America, maybe, or maybe it is just a British thing. https://www.youtube.com/results?search_query=The+Colditz+story https://en.wikipedia.org/wiki/The_Colditz_Story The reason why I am reminded of that film is that one of the British prisoners devises a plan for a group of British prisoners to escape from Colditz disguised as German officers and just walk out of the gate. This is ridiculed as impossible because it has been tried before at various prisoner of war camps and the people have always been detected as British prisoners. The man suggesting the scheme then points out that the detection is because there is clearly something questionable about the direction from which the disguised prisoners arrive, such as from a prisoners' hut, that is the problem, not the quality of the disguises or the basic soundness of the idea. The man then suggests that they walk out of the German Officers' mess building. Please bear in mind that walking out of the door of the mess building does not mean actually being in the mess, it is a matter of going down the flight of stairs from a storage area, (the stairs having been accessed from under the stage of the castle theatre) walking past the entrance to the dining room and then out of the door, supposedly on their way back, after dinner, to their billets in the village. This done while a concert put on by some others of the prisoners, and attended by the senior German officers, is going on in the castle theatre. So, it is the bit about an idea coming from the wrong direction that reminds me of the film. https://www.youtube.com/watch?v=0eeSYvxVFUw https://www.youtube.com/watch?v=iY8jMkIbwDM https://www.youtube.com/watch?v=QxHsElyFsTI William Overington Monday 27 August 2018 Original message >From : peter...@microsoft.com Date : 2018/08/27 - 20:33 (GMTDT) To : wjgo_10...@btinternet.com, jameskass...@gmail.com, richard.wording...@ntlworld.com, m...@kli.org, beckie...@gmail.com, verd...@wanadoo.fr Subject : RE: Private Use areas That sounds like a non-conformant use of characters in the U+24xx block. Peter From: Unicode On Behalf Of William_J_G Overington via Unicode Sent: Monday, August 27, 2018 2:00 AM To: jameskass...@gmail.com; richard.wording...@ntlworld.com; m...@kli.org; beckie...@gmail.com; verd...@wanadoo.fr Cc: unicode@unicode.org Subject: Re: Private Use areas Hi How about the following method. In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context. http://www.unicode.org/charts/PDF/U2460.pdf Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hex
Re: Private Use areas
Peter Constable wrote, > That sounds like a non-conformant use of characters in the U+24xx block. Non-conformant? Well, it's probably overkill anyway. A simpler method of identifying which PUA convention is being used for a file would be to either have the first line of the file being something like [PUA1] or to have the file name be something like MYFILE.TXTPUA1. Where "PUA1" equals the CSUR. Other numbers (PUA2, PUA3, etc.) for other PUA conventions. If a user has thousands of files using PUA characters, and all the files are using the same PUA convention, why would each file need to contain metadata for each PUA character used within? (Rhetorical) The "prior agreement" part about PUA usage means the user would know in advance how to display the text properly.
RE: Private Use areas
This was meant to go to the list. From: Peter Constable Sent: Monday, August 27, 2018 12:33 PM To: wjgo_10...@btinternet.com; jameskass...@gmail.com; richard.wording...@ntlworld.com; m...@kli.org; beckie...@gmail.com; verd...@wanadoo.fr Subject: RE: Private Use areas That sounds like a non-conformant use of characters in the U+24xx block. Peter From: Unicode mailto:unicode-boun...@unicode.org>> On Behalf Of William_J_G Overington via Unicode Sent: Monday, August 27, 2018 2:00 AM To: jameskass...@gmail.com<mailto:jameskass...@gmail.com>; richard.wording...@ntlworld.com<mailto:richard.wording...@ntlworld.com>; m...@kli.org<mailto:m...@kli.org>; beckie...@gmail.com<mailto:beckie...@gmail.com>; verd...@wanadoo.fr<mailto:verd...@wanadoo.fr> Cc: unicode@unicode.org<mailto:unicode@unicode.org> Subject: Re: Private Use areas Hi How about the following method. In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context. http://www.unicode.org/charts/PDF/U2460.pdf Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it. Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted. Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters. Maybe other circled numbers in the range 10 through to 19 would have special meanings. This method would keep everything within plane zero. William Overington Monday 27 August 2018 Original message From : unicode@unicode.org<mailto:unicode@unicode.org> Date : 2018/08/21 - 23:23 (GMTDT) To : d...@ewellic.org<mailto:d...@ewellic.org> Cc : unicode@unicode.org<mailto:unicode@unicode.org> Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode mailto:unicode@unicode.org>> wrote: Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. As would I.
Re: Private Use areas
Hi How about the following method. In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context. http://www.unicode.org/charts/PDF/U2460.pdf Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9. Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it. Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted. Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters. Maybe other circled numbers in the range 10 through to 19 would have special meanings. This method would keep everything within plane zero. William Overington Monday 27 August 2018 Original message >From : unicode@unicode.org Date : 2018/08/21 - 23:23 (GMTDT) To : d...@ewellic.org Cc : unicode@unicode.org Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode wrote: Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. As would I.
RE: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90° and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case). Cf. UAX 50. Peter -Original Message- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Tuesday, August 21, 2018 3:02 AM To: unicode@unicode.org Subject: Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...)) On Tue, 21 Aug 2018 08:53:18 +0800 via Unicode wrote: > On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote: > > Still, maybe it > > doesn't really matter much: your special-purpose font can treat any > > codepoint any way it likes, right? > Not all properties come from the font. For example a Zhuang character > PUA font, which supplements CJK ideographs, does not rotate characters > 90 degrees, when change from RTL to vertical display of text. Isn't that supposed to be treated by an OpenType feature such as 'vert'? Or does the rendering stack get in the way? However, one might need reflowing text to be about 40% WJ. Richard.
Re: Private Use areas
> On 21 August 2018 at 01:04 "Mark E. Shoulson via Unicode" > wrote: > > It is kind of a bummer, though, that you can't experiment (easily? or at > all?) in the PUA with scripts that have complex behavior, or even > not-so-complex behavior like accents & combining marks, or RTL direction > (here, also, am I speaking true? Is there a block of RTL PUA also? I guess > there's always RLO, but meh.) Still, maybe it doesn't really matter much: > your special-purpose font can treat any codepoint any way it likes, right? > > ~mark > > Back in 2006, I was typing the Tai Tham script (then being proposed as the Lanna script) using the PUA and exploring the issue of selecting between what are now and based on the preceding character and between what are now and based on the preceding base character and its subscripts. I was also looking at using variation selectors to override the rules. I was using SIL Graphite fonts when they was getting intermittent support in OpenOffice and Firefox - my main display engine was WorldPad. Nowadays, SIL Graphite seems to be securely supported in LibreOffice and Firefox. Now, back then, Graphite was at least attempting to support RTL; I would expect the RTL support to work well by now. On the other hand, experimenting with OpenType is much harder. The best I've found is transcoding to a Latin range and using an ssxx feature to convert the Latin glyphs back to those for the complex script. I do that to render Tai Tham in Internet Explorer 11 on Windows 7; this complex scheme is a fallback for when the rendering engine fails. Richard.
Re: Emacs Verbose Character Entry (was Private Use Areas)
On Thu, Aug 23 2018 at 22:15 +0100, unicode@unicode.org writes: > On Thu, 23 Aug 2018 21:47:03 +0200 > "Janusz S. Bień via Unicode" wrote: > >> My needs are very simple, for example C-x 8 Return LATIN CAPITAL >> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with >> the code E010. I can provide the list of names and codes. > > While it should obviously yield, if anything, or > for 'LATIN CAPITAL LETTER A WITH MACRON AND > BREVE', In my opinion there is no question what 'LATIN CAPITAL LETTER A WITH MACRON AND BREVE' should yield, because the name should be absent on the name list. My example concerns names like 'LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI]' 'COMBINING ABBREVIATION MARK SUPERSCRIPT UR ROUND R FORM [MUFI]' etc. [...] > The Emacs command "C-x 8 RET" expects the name of a single codepoint. It's OK and in my opinion it should stay this way. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
On Fri, Aug 24 2018 at 16:12 +0300, e...@gnu.org writes: >> From: jsb...@mimuw.edu.pl (Janusz S. Bień) >> Cc: unicode@unicode.org, richard.wording...@ntlworld.com >> Date: Thu, 23 Aug 2018 21:47:03 +0200 >> >> I'm very glad you join the discussion. > > I'm sorry for not joining sooner. In my defense, I missed the > reference to Emacs, and the rest of the discussion is not really > interesting for me, as using PUA for new characters is not something I > have interest in or experience with. I don't think you missed anything important. > >> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER >> A WITH MACRON AND BREVE [MUFI] should yield the character with the code >> E010. I can provide the list of names and codes. > > So you'd like to extend "C-x 8 RET" to recognize names of additional > characters and associate them with codepoints in the PUA area? That > shouldn't be hard to add. I would prefer extensibility over efficiency, I don't mind loading PUA information from a source declared somehow in .emacs.d., so I can change/expand the list of characters from time to time. > But is that all? won't you also want to tell Emacs about the > properties of those characters? Personally I would like additionally to be able to change the case of a letter or string, and I am willing to prepare the necessary information for MUFI characters. Displaying other properties would be nice, but for me this is not crucial. Moreover, somebody has to prepare the data... > or be able to set up fonts for displaying them? It would be nice. I haven't asked for it because I typeset my texst with XeTeX or LuaTeX and the input is more important for me than rendering. > IOW, would it be okay to have these > characters be "second-class citizens" in Emacs? For me it would be acceptable. BTW, I just got perhaps a crazy idea: what about treating a PUA declaration (as you probably noticed, there may be conficting ones) as a separate coding system? Of course some mechanism for escaping the standard PUA interpretation would be needed. > >> > It is true that the Unicode related data is produced at build time, >> > but only some of that is actually recorded in the Emacs binary, the >> > rest is loaded upon demand. But all the data is stored in data >> > structures that are mutable, given some Lisp programming. >> >> I never was fluent in Lisp programming and by now I forgot almost >> everything I knew, so it's not a task for me. I was thinking about >> submitting a feature request, but I forgot also the proper procedures to >> do it. > > The proper procedure is to type "M-x report-emacs-bug RET" and then > describe the feature(s) you'd like to see added/improved. I will definitely remember now :-) > >> Moreover I had the impression that I'm the only person who needs >> it... > > That shouldn't stop you. Many a feature in Emacs started as a request > from a single individual. > >> > (It is not clear to me which part of the Unicode data you would like >> > to change; are you talking about adding characters to the list of >> > those defined by Unicode? If you are using the PUA codepoints, it's >> > possible that you will need to update Emacs's notion of PUA as well.) >> >> Yes, I would like the PUA codepoints to be handled analogically as the >> proper ones. What do you mean by Emacs's notion of PUA? > > Emacs knows about the PUA regions of the Unicode code-space, and > treats those codepoints specially. The features you request will > probably need to affect the PUA region as well, because the codepoints > you use should no longer be treated as PUA. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
Hi An approach that you might like to consider in relation to fonts is that it is possible to have in a font a Description field that consists of plain text. It is stored twice in the font, in two different ways, one of which is just plain text, possibly just ASCII. So if you had text such as $$$PUAB and so on in that Description field than a software application could search for all occurrences of $$$ and gather information for each set of data in that way, without needing separate OpenType tables. As an example of how information can be stored in the Description field here is a link to a font that I made years ago. If you download the font and open it is WordPad, the text can be read. The direct link is as follows. www.users.globalnet.co.uk/~ngo/SPANGBLU.TTF The font is also linked from the following web page, about a quarter of the way down the page. http://www.users.globalnet.co.uk/~ngo/fonts.htm The web pages encoded in the font are for three of the songs linked from the following page. http://www.users.globalnet.co.uk/~ngo/song0001.htm Best regards, William Overington Friday 24 August 2018 Original message >From : unicode@unicode.org Date : 2018/08/21 - 19:23 (GMTDT) To : unicode@unicode.org Subject : Re: Private Use areas On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode wrote: I think PUA users should provide the properties of the characters used in a form analogical to the Unicode itself, and the software should be able to use this additional information. I already provide this myself for my uses of the PUA as well as the CSUR and any vendor-specific agreements I can find: http://www.kreativekorp.com/charset/PUADATA/ Of course there is no way to get software to use this information. I have entertained the idea of being able to embed this information into the font itself as OpenType tables, e.g.: PUAB -> Blocks.txt PUAC -> CaseFolding.txt PUAW -> EastAsianWidth.txt PUAL -> LineBreak.txt PUAD -> UnicodeData.txt I've actually invented table names for the majority of UCD files, but those are probably the most relevant. The table names for the more obscure files get rather... creative, e.g.: PUA[ -> BidiBrackets.txt PUA] -> BidiMirroring.txt That alone may get some people to think twice about this idea. :P
Re: Emacs Verbose Character Entry (was Private Use Areas)
> Date: Thu, 23 Aug 2018 22:15:10 +0100 > From: Richard Wordingham via Unicode > > On Thu, 23 Aug 2018 21:47:03 +0200 > "Janusz S. Bień via Unicode" wrote: > > > My needs are very simple, for example C-x 8 Return LATIN CAPITAL > > LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with > > the code E010. I can provide the list of names and codes. > > While it should obviously yield, if anything, or > for 'LATIN CAPITAL LETTER A WITH MACRON AND > BREVE', it would probably be more important to recognise formal > aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo > ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao > letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING. > > For , I prefer to type "A\_M_X", but then I learnt > XSAMPA. The Emacs command "C-x 8 RET" expects the name of a single codepoint. It should be possible to extend it (or perhaps provide a separate command) that produced named sequence of codepoints, such as those in the above examples, but there's no such feature as of now. If this would be a useful addition, please suggest that on the Emacs issue tracker (using "M-x report-emacs-bug"), and please include with your request the sources where we could find such named sequences to support. Thanks.
Re: Private Use areas
> From: jsb...@mimuw.edu.pl (Janusz S. Bień) > Cc: unicode@unicode.org, richard.wording...@ntlworld.com > Date: Thu, 23 Aug 2018 21:47:03 +0200 > > I'm very glad you join the discussion. I'm sorry for not joining sooner. In my defense, I missed the reference to Emacs, and the rest of the discussion is not really interesting for me, as using PUA for new characters is not something I have interest in or experience with. > My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER > A WITH MACRON AND BREVE [MUFI] should yield the character with the code > E010. I can provide the list of names and codes. So you'd like to extend "C-x 8 RET" to recognize names of additional characters and associate them with codepoints in the PUA area? That shouldn't be hard to add. But is that all? won't you also want to tell Emacs about the properties of those characters? or be able to set up fonts for displaying them? IOW, would it be okay to have these characters be "second-class citizens" in Emacs? > > It is true that the Unicode related data is produced at build time, > > but only some of that is actually recorded in the Emacs binary, the > > rest is loaded upon demand. But all the data is stored in data > > structures that are mutable, given some Lisp programming. > > I never was fluent in Lisp programming and by now I forgot almost > everything I knew, so it's not a task for me. I was thinking about > submitting a feature request, but I forgot also the proper procedures to > do it. The proper procedure is to type "M-x report-emacs-bug RET" and then describe the feature(s) you'd like to see added/improved. > Moreover I had the impression that I'm the only person who needs > it... That shouldn't stop you. Many a feature in Emacs started as a request from a single individual. > > (It is not clear to me which part of the Unicode data you would like > > to change; are you talking about adding characters to the list of > > those defined by Unicode? If you are using the PUA codepoints, it's > > possible that you will need to update Emacs's notion of PUA as well.) > > Yes, I would like the PUA codepoints to be handled analogically as the > proper ones. What do you mean by Emacs's notion of PUA? Emacs knows about the PUA regions of the Unicode code-space, and treats those codepoints specially. The features you request will probably need to affect the PUA region as well, because the codepoints you use should no longer be treated as PUA.
Re: Private Use areas
On Thu, Aug 23 2018 at 11:49 -0700, beckie...@gmail.com writes: > On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bień wrote: > > > I already provide this myself for my uses of the PUA as well as the > > CSUR and any vendor-specific agreements I can find: > > > > http://www.kreativekorp.com/charset/PUADATA/ > > I would prefer to see the data in a repository, so others can can > comment and contribute. > > That is actually my intent for the future. Though it's not quite ready yet: > > https://github.com/kreativekorp/charset/tree/master/puadata Great! > > That's the data in a "pre-compiled" form; it's turned into a "proper" > PUADATA directory using this script: > > https://github.com/kreativekorp/charset/blob/master/bin/build-public.py > > As for "any vendor-specific agreements", do MUFI and LINCUA qualify? > > I certainly do want to see MUFI and LINCUA provided in this form, but > I put them in a different category along with CSUR. I basically have > three categories of PUA agreements: > > Fonts - PUA assignments specific to a font family, e.g. Constructium, > Fairfax, Nishiki-teki, Quivira, Junicode, etc. You are probably aware that Junicode 1.000, released in September 2017, supports in full MUFI 4.0 (released in December 2015). I don't know whether Junicode contains now any PUA characters which are not in MUFI. > > Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR, > MUFI, LINCUA, etc. > > Vendors - PUA assignments meant to be used by a single vendor or > platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc. > > Thank you for those links by the way. I had tried to find charts for > MUFI in the past but had somehow been unsuccessful. Similar files for different purpose has been created by Mikkel Eide Eriksen: https://github.com/mikkelee/mufi-latex An earlier version of MUFI was incorporated in the ENRICH Gaiji bank: http://v2.manuscriptorium.com/apps/gbank/ You can download the source but it doesn't seem useful. A version of MUFI is available also as a searchable character database created by the present single-person MUFI board, i.e. Tarrin Wills, as a part of the beta version of a new MUFI site: http://skaldic.abdn.ac.uk/m.php?p=mufi Some time ago I wrote on the mufi-fonts list: --8<---cut here---start->8--- On Sun, Dec 03 2017 at 6:55 +0100, jsb...@mimuw.edu.pl writes: [...] > I wanted the file quickly to get an overview of the recently released > corpus of 16th century Polish, and it's seemed to me that the simplest > and fastest way is to convert the PDF recommendation in a semi-automatic > way. It was more cumbersome than I expected, but thanks to this approach > I've discovered a typo in the recommendation: letter I instead of digit > 1 in EAFI, the code for LATIN ENLARGED LETTER SMALL LIGATURE AE (p. 93 > in the code chart order version). > > For the planned extension of the program I need more info on MUFI > characters, preferably in the format of the UnicodeData.txt. This time > however I intend to make haste slowly, so I have a question: > > Is it possible to make publicly available for download the database > underlying http://skaldic.abdn.ac.uk/db.php?if=mufi&table=mufi_char? --8<---cut here---end--->8--- Unfortunately I got no answer to the question. > > Of course there is no way to get software to use this information. > > What kind of software do you have in mind? > > Unicode-related utilities, text editors to start with. You pretty much > hit the nail on the head with uniname and emacs as examples. :) Thanks! As for uniname by Bill Poser, I exchanged mails with him in 2011: --8<---cut here---start->8--- On Sun, Aug 28 2011 at 12:01 +0200, jsb...@mimuw.edu.pl writes: [...] > A student of mine wrote an alternative program according to my > specification. The program is GPLed and available with > > git clone http://students.mimuw.edu.pl/~findepi/unihistext unihistext Now https://bitbucket.org/jsbien/unihistext > > The source is ready for Debian packaging. > > I think the program is worth better distribution, but its author is no > longer interested in it. Would you be so kind to consider including > either the program itself in your uniutils or extend your unidesc with > its features? > > Best regards > > Janusz On Sun, Aug 28 2011 at 16:03 -0700, billpos...@gmail.com writes: > In principle, sure. I'll have a look at it. --8<---cut here---end--->8--- Unfortunatelly nothing happened, and I thought I should not press the point. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Emacs Verbose Character Entry (was Private Use Areas)
On Thu, 23 Aug 2018 21:47:03 +0200 "Janusz S. Bień via Unicode" wrote: > My needs are very simple, for example C-x 8 Return LATIN CAPITAL > LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with > the code E010. I can provide the list of names and codes. While it should obviously yield, if anything, or for 'LATIN CAPITAL LETTER A WITH MACRON AND BREVE', it would probably be more important to recognise formal aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING. For , I prefer to type "A\_M_X", but then I learnt XSAMPA. Richard.
Re: Private Use areas
On Thu, 23 Aug 2018 20:34:20 +0200 "Janusz S. Bień via Unicode" wrote: > This is a typical but IMHO obsolete perspective. Fonts are for > *rendering*, new characters and variants are more and more often > needed for *input* of real life old texts with sufficient precision. If we're talking about glyphs which don't actually correspond to new characters, then that sounds like a good case for private use variation selectors. To quote Tully, "Abusus non tollit usum". Richard.
Re: Private Use areas
On Thu, Aug 23 2018 at 22:17 +0300, e...@gnu.org writes: >> Date: Thu, 23 Aug 2018 20:30:52 +0200 >> Cc: Richard Wordingham >> From: "Janusz S. Bień via Unicode" >> >> >> and in Emacs - to my disappointed it looks like the Unicode data are >> >> set at the compile time, but perhaps this can be negotiated with the >> >> developers. >> > >> > Can you be more specific? >> >> I often search characters by name with C-x 8 Return. I would like to use >> it also for MUFI characters, I have already the name list (the example >> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked >> very closely into the problem and don't remember now the details, but my >> impression was that it's not simple. > > What is "it" in the last sentence? IOW, what is not simple about that > with Emacs? I'm very glad you join the discussion. My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with the code E010. I can provide the list of names and codes. > > It is true that the Unicode related data is produced at build time, > but only some of that is actually recorded in the Emacs binary, the > rest is loaded upon demand. But all the data is stored in data > structures that are mutable, given some Lisp programming. I never was fluent in Lisp programming and by now I forgot almost everything I knew, so it's not a task for me. I was thinking about submitting a feature request, but I forgot also the proper procedures to do it. Moreover I had the impression that I'm the only person who needs it... > > (It is not clear to me which part of the Unicode data you would like > to change; are you talking about adding characters to the list of > those defined by Unicode? If you are using the PUA codepoints, it's > possible that you will need to update Emacs's notion of PUA as well.) Yes, I would like the PUA codepoints to be handled analogically as the proper ones. What do you mean by Emacs's notion of PUA? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
> Date: Thu, 23 Aug 2018 20:30:52 +0200 > Cc: Richard Wordingham > From: "Janusz S. Bień via Unicode" > > >> and in Emacs - to my disappointed it looks like the Unicode data are > >> set at the compile time, but perhaps this can be negotiated with the > >> developers. > > > > Can you be more specific? > > I often search characters by name with C-x 8 Return. I would like to use > it also for MUFI characters, I have already the name list (the example > directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked > very closely into the problem and don't remember now the details, but my > impression was that it's not simple. What is "it" in the last sentence? IOW, what is not simple about that with Emacs? It is true that the Unicode related data is produced at build time, but only some of that is actually recorded in the Emacs binary, the rest is loaded upon demand. But all the data is stored in data structures that are mutable, given some Lisp programming. (It is not clear to me which part of the Unicode data you would like to change; are you talking about adding characters to the list of those defined by Unicode? If you are using the PUA codepoints, it's possible that you will need to update Emacs's notion of PUA as well.)
Re: Private Use areas
On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bień wrote: > > I already provide this myself for my uses of the PUA as well as the > > CSUR and any vendor-specific agreements I can find: > > > > http://www.kreativekorp.com/charset/PUADATA/ > > I would prefer to see the data in a repository, so others can can > comment and contribute. > That is actually my intent for the future. Though it's not quite ready yet: https://github.com/kreativekorp/charset/tree/master/puadata That's the data in a "pre-compiled" form; it's turned into a "proper" PUADATA directory using this script: https://github.com/kreativekorp/charset/blob/master/bin/build-public.py As for "any vendor-specific agreements", do MUFI and LINCUA qualify? > I certainly do want to see MUFI and LINCUA provided in this form, but I put them in a different category along with CSUR. I basically have three categories of PUA agreements: Fonts - PUA assignments specific to a font family, e.g. Constructium, Fairfax, Nishiki-teki, Quivira, Junicode, etc. Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR, MUFI, LINCUA, etc. Vendors - PUA assignments meant to be used by a single vendor or platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc. Thank you for those links by the way. I had tried to find charts for MUFI in the past but had somehow been unsuccessful. > Of course there is no way to get software to use this information. > > What kind of software do you have in mind? > Unicode-related utilities, text editors to start with. You pretty much hit the nail on the head with uniname and emacs as examples. :)
Re: Private Use areas
On Thu, Aug 23 2018 at 17:26 +0100, unicode@unicode.org writes: > On Thu, 23 Aug 2018 17:39:15 +0200 > Philippe Verdy via Unicode wrote: > >> You make a confusion: I do not propose "hacking" existing codes, but >> instead adding new codes for private variations. It's then up to PUV >> sequence authors to choose an appropropriate base character that can >> have the properties they want to be inherited by the private-use >> variation sequence, or to choose a base character that will provide >> some reasonnable reading if rendererd as is (by renderers or fonts >> not implementing the pricate viaration sequence, give nthat they will >> also append a symbol for the PUV itself after the standard character). > > Variation sequences cannot be used to add new characters. Most PUA > characters are used to represent new characters. A > standard-conformant private variation sequence would generally achieve > the same effect as could be achieved by a font feature (typically one > of the cvxx, though possibly one of the ssxx), This is a typical but IMHO obsolete perspective. Fonts are for *rendering*, new characters and variants are more and more often needed for *input* of real life old texts with sufficient precision. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
On Thu, Aug 23 2018 at 17:11 +0100, unicode@unicode.org writes: > On Thu, 23 Aug 2018 14:10:35 +0200 > "Janusz S. Bień via Unicode" wrote: > >> What kind of software do you have in mind? >> >> I'm primarily interested in the locally developed programs >> >> https://bitbucket.org/jsbien/unihistext/ >> >> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ > > It looks as though the security certificates are awry - has someone > forgotten to pay the protection money to the right people? (Firefox > objects with "The page you are trying to view cannot be shown because > the authenticity of the received data could not be verified.") I see no such problems with Firefox ESR 52.9.0 on Debian testing. Moreover the program reports that the certificate is valid till 04/21/2020. > >> and in Emacs - to my disappointed it looks like the Unicode data are >> set at the compile time, but perhaps this can be negotiated with the >> developers. > > Can you be more specific? I often search characters by name with C-x 8 Return. I would like to use it also for MUFI characters, I have already the name list (the example directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked very closely into the problem and don't remember now the details, but my impression was that it's not simple. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
Le jeu. 23 août 2018 à 18:31, Richard Wordingham via Unicode < unicode@unicode.org> a écrit : > On Thu, 23 Aug 2018 17:39:15 +0200 > Philippe Verdy via Unicode wrote: > > > You make a confusion: I do not propose "hacking" existing codes, but > > instead adding new codes for private variations. It's then up to PUV > > sequence authors to choose an appropropriate base character that can > > have the properties they want to be inherited by the private-use > > variation sequence, or to choose a base character that will provide > > some reasonnable reading if rendererd as is (by renderers or fonts > > not implementing the pricate viaration sequence, give nthat they will > > also append a symbol for the PUV itself after the standard character). > > Variation sequences cannot be used to add new characters. Did you remember I did not speak about existing variation sequences ? Only about the new encocing do provite use variation sequences which do not have to obey the policy of exising VS, and whose purpose whould be to inherit most properties (notably direction, breaking, spacing, general category of another existing character). > Most PUA > characters are used to represent new characters. I did not speak as well about PUAs.
Re: Private Use areas
On Thu, 23 Aug 2018 17:39:15 +0200 Philippe Verdy via Unicode wrote: > You make a confusion: I do not propose "hacking" existing codes, but > instead adding new codes for private variations. It's then up to PUV > sequence authors to choose an appropropriate base character that can > have the properties they want to be inherited by the private-use > variation sequence, or to choose a base character that will provide > some reasonnable reading if rendererd as is (by renderers or fonts > not implementing the pricate viaration sequence, give nthat they will > also append a symbol for the PUV itself after the standard character). Variation sequences cannot be used to add new characters. Most PUA characters are used to represent new characters. A standard-conformant private variation sequence would generally achieve the same effect as could be achieved by a font feature (typically one of the cvxx, though possibly one of the ssxx), though using font features would be fiddlier and have more limited support, and variation sequences would facilitate data processing. Richard.
Re: Private Use areas
On Thu, 23 Aug 2018 14:10:35 +0200 "Janusz S. Bień via Unicode" wrote: > What kind of software do you have in mind? > > I'm primarily interested in the locally developed programs > > https://bitbucket.org/jsbien/unihistext/ > > https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ It looks as though the security certificates are awry - has someone forgotten to pay the protection money to the right people? (Firefox objects with "The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.") > and in Emacs - to my disappointed it looks like the Unicode data are > set at the compile time, but perhaps this can be negotiated with the > developers. Can you be more specific? For Indic rearrangement I had to define syllables myself with definitions which I then added to composition-function-table. Unfortunately, I then hit the problem that I had to define Indic rearrangement myself, and OpenType fonts fall into several incompatible families, which is why I haven't released a general solution. My emacs kit for Tai Tham is given via http://www.wrdingham.co.uk/lanna/toolkit.html (a probable kinsman got the 'o'), but there are a lot of odds and ends that need sorting out. I would expect that you would be able to override any relevant 'compiler' settings via your Emacs start up file - I expect Eli Zaretski will be along soon with more details. Of course, you could always revert to the old tradition and recompile Emacs yourself - though it may need something like MinGW to compile for Windows. Richard.
Re: Private Use areas
You make a confusion: I do not propose "hacking" existing codes, but instead adding new codes for private variations. It's then up to PUV sequence authors to choose an appropropriate base character that can have the properties they want to be inherited by the private-use variation sequence, or to choose a base character that will provide some reasonnable reading if rendererd as is (by renderers or fonts not implementing the pricate viaration sequence, give nthat they will also append a symbol for the PUV itself after the standard character). Also I do not want to change anything to any existing variation sequences (using VS1 and so on) and their encoding policies, requiring a prior registration and standardisation. Le jeu. 23 août 2018 à 11:42, Richard Wordingham via Unicode < unicode@unicode.org> a écrit : > On Wed, 22 Aug 2018 11:58:58 +0200 > Philippe Verdy via Unicode wrote: > > > For now there's still no way to have variant sequences unless they are > > registered and standardized by Unicode but registration should be not > > needed (forbidden) for sequences containing PUV. > > I believe this scheme is no worse than hack encodings that using Latin > character codes for other characters. These schemes often work. > (Indeed, the currently best method of getting Tai Tham displayed as rich > text that I can find is to use a transliteration-type encoding and a > special font, though I can now get pretty close using the proper > character codes in the order laid down in the proposals.) > > The major problems I can see with appropriating variation sequences > are: > (1) It might be restricted to base characters - I have no > experimental evidence on whether this would happen. Fonts can happily > convert base characters to combining characters, though this works > best if Latin line-breaking rules take effect. > > (2) The appropriated variation sequence might be assigned a meaning - > but this is no worse than the general ambiguity of PUA characters. > > (3) Some base characters get special treatment. For example, I had > to change my transliteration scheme because hyphen-minus is treated > specially by MS Edge - I was using it as a digraph disjunctor - and > so clusters were not being formed. In this case, I would have come > unstuck as soon as line-wrapping started, so it was a bad choice anyway. > > Or are there significant renderers that deliberately ignore variation > selectors in unregistered, unstandardised variation sequences? I don't > recall any problems from when we were discussing variation > sequences for chess pieces. > > For supplementing a script, it might be best to start at > VARIATION-SELECTOR-256, and work down if need be with specialist > characters. > > Richard. >
Re: Private Use areas
On Tue, Aug 21 2018 at 11:23 -0700, unicode@unicode.org writes: > On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode > wrote: > > I think PUA users should provide the > properties of the characters used in a form analogical to the Unicode > itself, and the software should be able to use this additional > information. > > I already provide this myself for my uses of the PUA as well as the > CSUR and any vendor-specific agreements I can find: > > http://www.kreativekorp.com/charset/PUADATA/ I would prefer to see the data in a repository, so others can can comment and contribute. As for "any vendor-specific agreements", do MUFI and LINCUA qualify? https://folk.uib.no/hnooh/mufi/ http://andron-typeforum.xobor.de/t10f13-Towards-a-linguistic-corporate-use-area-LINCUA.html > > Of course there is no way to get software to use this information. What kind of software do you have in mind? I'm primarily interested in the locally developed programs https://bitbucket.org/jsbien/unihistext/ https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/ and in Emacs - to my disappointed it looks like the Unicode data are set at the compile time, but perhaps this can be negotiated with the developers. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
On Wed, 22 Aug 2018 11:58:58 +0200 Philippe Verdy via Unicode wrote: > For now there's still no way to have variant sequences unless they are > registered and standardized by Unicode but registration should be not > needed (forbidden) for sequences containing PUV. I believe this scheme is no worse than hack encodings that using Latin character codes for other characters. These schemes often work. (Indeed, the currently best method of getting Tai Tham displayed as rich text that I can find is to use a transliteration-type encoding and a special font, though I can now get pretty close using the proper character codes in the order laid down in the proposals.) The major problems I can see with appropriating variation sequences are: (1) It might be restricted to base characters - I have no experimental evidence on whether this would happen. Fonts can happily convert base characters to combining characters, though this works best if Latin line-breaking rules take effect. (2) The appropriated variation sequence might be assigned a meaning - but this is no worse than the general ambiguity of PUA characters. (3) Some base characters get special treatment. For example, I had to change my transliteration scheme because hyphen-minus is treated specially by MS Edge - I was using it as a digraph disjunctor - and so clusters were not being formed. In this case, I would have come unstuck as soon as line-wrapping started, so it was a bad choice anyway. Or are there significant renderers that deliberately ignore variation selectors in unregistered, unstandardised variation sequences? I don't recall any problems from when we were discussing variation sequences for chess pieces. For supplementing a script, it might be best to start at VARIATION-SELECTOR-256, and work down if need be with specialist characters. Richard.
Re: Private Use areas
May be this debate could find an end if there was a way to encode "private use variants", so that we can override an existing character with correct properties by creating a custom variant, which would immediately inherit the properties of the base character on which it is encoded. But for now there's no private use variant codes (PUV). I think that a small block of 16 codes (may be even less) would be largely enough (given that it would be used only in pairs after any standard character). They could be used after any base character, possibly even after a combining character (so the default combining class for these PUV should be 0). For now there's still no way to have variant sequences unless they are registered and standardized by Unicode but registration should be not needed (forbidden) for sequences containing PUV. I think there's a usage pattern for such schemes. Their default (spacing) glyph could be a dotted circle with a single hex digit inside, it would be itself non-joining, it would be itself bidi-neutral and used only after a base character from which it would inherit the directionality (so the glyph would appear automatically on the correct side). Actual fonts implementing these PUV sequences would treat the PUV sequences as distinct unbreakable entities mapped to their own abstract character, and subject to common ligation. Le mer. 22 août 2018 à 04:58, Andrew Cunningham via Unicode < unicode@unicode.org> a écrit : > > > On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode < > unicode@unicode.org> wrote: > >> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: >> >>> >>> >> Best we can do is shout loudly at OpenType tables and hope to cram in >> behavior (or at least appearance, which is more likely all we can get) that >> vaguely resembles what we're after. And that's not SO awful, given what >> we're dealing with. >> >>> >>> > At the moment I am looking at implementing three unencoded Arabic > characters in the PUA. > > For the foreseeable future OpenType is a non-starter, so I will look at > implementing them in Graphite tables in a font. > > Andrew > > > > -- > Andrew Cunningham > lang.supp...@gmail.com > > > >
Re: Private Use areas
On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode < unicode@unicode.org> wrote: > On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: > >> >> > Best we can do is shout loudly at OpenType tables and hope to cram in > behavior (or at least appearance, which is more likely all we can get) that > vaguely resembles what we're after. And that's not SO awful, given what > we're dealing with. > >> >> At the moment I am looking at implementing three unencoded Arabic characters in the PUA. For the foreseeable future OpenType is a non-starter, so I will look at implementing them in Graphite tables in a font. Andrew -- Andrew Cunningham lang.supp...@gmail.com
Re: Private Use areas
On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: Is there a block of RTL PUA also? No. Perhaps there should be? This is a periodic suggestion that never goes anywhere--for good reason. (You can search the email archives and see that it keeps coming up.) Presuming that this question was asked in good faith... Yeah, I know there has been talk about such things, and I also knew that whether or not there was an RTL block (which I did not remember for certain), there weren't going to be any *changes* in the PUA, and we were going to have to make do with what there was. There's no way to anticipate all the possible properties people would want in the PUA, though I remember thinking it was probably wrong to make the PUA *strongly* LTR; I know there's a not-strongly flavor too. Best we can do is shout loudly at OpenType tables and hope to cram in behavior (or at least appearance, which is more likely all we can get) that vaguely resembles what we're after. And that's not SO awful, given what we're dealing with. As I see it, the only feasible way for people to get specialized behavior for PUA ranges involves first ceasing to assume that somehow they can jawbone the UTC into *standardizing* some ranges for some particular use or another. That simply isn't going to happen. People who assume this is somehow easy, and that the UTC are a bunch of boneheads who stand in the way of obvious solutions, do not -- I contend -- understand the complicated interplay of character properties, stability guarantees, and implementation behavior baked into system support libraries for the Unicode Standard. The whole point of the PUA is that it *isn't* standardized (by the UTC). It might have been nice to make some more varied choices of things that couldn't be left unspecified, but you're still going to wind up with "but there aren't any PUA codepoints that are JUST what I need!" And, as said, it's too late now. ~mark
Re: Private Use areas
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode wrote: > Ken Whistler wrote: > > > The way forward for folks who want to do this kind thing is: > > > > 1. Define a *protocol* for reliable interchange of custom character > > property information about PUA code points. > > I've often thought that would be a great idea. You can't get to steps 2 > and 3 without step 1. I'd gladly participate in such a project. > As would I.
Re: Private Use areas
Ken Whistler wrote: > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character > property information about PUA code points. I've often thought that would be a great idea. You can't get to steps 2 and 3 without step 1. I'd gladly participate in such a project. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Private Use areas
On Tue, Aug 21, 2018 at 11:03:41AM -0700, Ken Whistler via Unicode wrote: > > On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: > > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: > > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > > > > Is there a block of RTL PUA also? > > > No. > > Perhaps there should be? > > This is a periodic suggestion that never goes anywhere--for good reason. > (You can search the email archives and see that it keeps coming up.) > > Presuming that this question was asked in good faith... Oif, looks like mere months of inattentive lurking are not enough (the thread I got pointed to was from 2011). Apologies. > > or perhaps by allocating a new range elsewhere. > See: > > https://www.unicode.org/policies/stability_policy.html > > The General_Category property value Private_Use (Co) is immutable: the set > of code points with that value will never change. > > That guarantee has been in place since 1996, and is a rule that binds the > UTC. So nope, sorry, no more PUA ranges. Right. > The way forward for folks who want to do this kind thing is: > > 1. Define a *protocol* for reliable interchange of custom character property > information about PUA code points. [...] > And if the goal for #3 is to get some *system* implementer to support the > protocol in widespread software, then before starting any of #1, #2, or #3, > you had better start instead with: > > 0. Create a consortium (or other ongoing organization) with a 10-year time > horizon and participation by at least one major software implementer, to > define, publicize, and advocate for support of the protocol. Heh, good point. I wonder, perhaps a long-lived consortium tasked with assigning properties to characters already exists? So your answer _does_ provide a way to go: any PUA use that's no longer private, or any problem someone has with character properties, should go through official channels here instead of inventing an own standard. With my existing hats on (Debian fonts team member, and someone who messes with terminals in general) I already have two such itches to scratch. Thus, it sounds like I should do the research, prepare a write-up, and then come back to harass you folks with inane questions. Inventing new solutions that work around instead of with you is a bad idea... Meow! -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ
Re: Private Use areas
On Tue, 21 Aug 2018 11:03:41 -0700 Ken Whistler via Unicode wrote: > On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: > Really? Suppose someone wants to implement a bicameral script in PUA. > They would need case mappings for that, and how would those be > "better represented in the font itself"? Or how about digits? Would > numeric values for digits be "better represented in the font itself"? > How about implementation of punctuation? Would segmentation > properties and behavior be "better represented in the font itself"? The least intrusive way of defining the meaning of a graphic (sensu lato) character is by a font, in a very wide sense that would interpret a Unicode code chart as a font. Without a font in this sense, normal characters in the PUA have no meaning. If one insists on a font to have an interpretation, then: (1) PUA characters in plain text are meaningless - I believe that's pretty much the position now. (2) Different schemes can co-exist, even within the same formatted document, by having different formats. This is the case now. It then makes sense to store the properties in the font, which needs to be saved with or in the document for the document to continue to make sense. Casing and digits are luxuries. Are we not told that searching should be done by collation? We then do not need case-folding! Interpreting the preferred representation of Roman numerals does not use Unicode properties beyond the approximate principle of one character, one codepoint. As to segmentation, my understanding was that there were no characters available to indicate word boundaries in scriptio continua; the closest one has is line-breaking suggestions. If my memory serves me right, SIL Graphite fonts can hold line-breaking information. Richard.
Re: Private Use areas
On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode < unicode@unicode.org> wrote: > I think PUA users should provide the > properties of the characters used in a form analogical to the Unicode > itself, and the software should be able to use this additional > information. > I already provide this myself for my uses of the PUA as well as the CSUR and any vendor-specific agreements I can find: http://www.kreativekorp.com/charset/PUADATA/ Of course there is no way to get software to use this information. I have entertained the idea of being able to embed this information into the font itself as OpenType tables, e.g.: PUAB -> Blocks.txt PUAC -> CaseFolding.txt PUAW -> EastAsianWidth.txt PUAL -> LineBreak.txt PUAD -> UnicodeData.txt I've actually invented table names for the majority of UCD files, but those are probably the most relevant. The table names for the more obscure files get rather... creative, e.g.: PUA[ -> BidiBrackets.txt PUA] -> BidiMirroring.txt That alone may get some people to think twice about this idea. :P
Re: Private Use areas
On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: Is there a block of RTL PUA also? No. Perhaps there should be? This is a periodic suggestion that never goes anywhere--for good reason. (You can search the email archives and see that it keeps coming up.) Presuming that this question was asked in good faith... What about designating a part of the PUA to have a specific property? The problem with that is that assigning *any* non-default property to any PUA code point would break existing implementations' assumptions about PUA character properties and potentially create havoc with existing use. Only certain properties matter enough: That is an un-demonstrated assertion that I don't think you have thought through sufficiently. * wide * RTL RTL is not some binary counterpart of LTR. There are 23 values of Bidi_Class, and anyone who wanted to implement a right-to-left script in PUA might well have to make use of multiple values of Bidi_Class. Also, there are two major types of strong right-to-leftness: Bidi_Class=R and Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or non-Arabic type behavior? * combining Also not a binary switch. Canonical_Combining_Class is a numeric value, and any value but ccc=0 for a PUA character would break normalization. Then for the General_Category, there are three types of "marks" that count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored in any PUA assignment? as most others are better represented in the font itself. Really? Suppose someone wants to implement a bicameral script in PUA. They would need case mappings for that, and how would those be "better represented in the font itself"? Or how about digits? Would numeric values for digits be "better represented in the font itself"? How about implementation of punctuation? Would segmentation properties and behavior be "better represented in the font itself"? This could be done either by parceling one of existing PUA ranges: planes 15 and 16 are virtually unused thus any damage would be negligible; That is simply an assertion -- and not the kind of assertion that the UTC tends to accept on spec. I rather suspect that there are multiple participants on this email list, for example, who *do* have implementations making extensive use of Planes 15/16 PUA code points for one thing or another. or perhaps by allocating a new range elsewhere. See: https://www.unicode.org/policies/stability_policy.html The General_Category property value Private_Use (Co) is immutable: the set of code points with that value will never change. That guarantee has been in place since 1996, and is a rule that binds the UTC. So nope, sorry, no more PUA ranges. Meow! Grrr! ;-) As I see it, the only feasible way for people to get specialized behavior for PUA ranges involves first ceasing to assume that somehow they can jawbone the UTC into *standardizing* some ranges for some particular use or another. That simply isn't going to happen. People who assume this is somehow easy, and that the UTC are a bunch of boneheads who stand in the way of obvious solutions, do not -- I contend -- understand the complicated interplay of character properties, stability guarantees, and implementation behavior baked into system support libraries for the Unicode Standard. The way forward for folks who want to do this kind thing is: 1. Define a *protocol* for reliable interchange of custom character property information about PUA code points. 2. Convince more than one party to actually *use* that protocol to define sets of interchangeable character property definitions. 3. Convince at least one implementer to support that protocol to create some relevant interchangeable *behavior* for those PUA characters. And if the goal for #3 is to get some *system* implementer to support the protocol in widespread software, then before starting any of #1, #2, or #3, you had better start instead with: 0. Create a consortium (or other ongoing organization) with a 10-year time horizon and participation by at least one major software implementer, to define, publicize, and advocate for support of the protocol. (And if you expect a major software implementer to participate, you might need to make sure you have a business case defined that would warrant such a 10-year effort!) --Ken
Re: Private Use areas
2011 Thread: https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0124.html Please read in particular these two: - https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0174.html - https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0212.html (tl;dr: 1. the PUA set is fixed, 2. being private, the properties may be overridable by conformant implementations.) On Mon, Aug 20, 2018 at 5:17 PM Ken Whistler via Unicode < unicode@unicode.org> wrote: > > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > > Is there a block of RTL PUA also? > > No. > > --Ken >
Re: Private Use areas
On Tue, Aug 21 2018 at 16:56 +0200, unicode@unicode.org writes: > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: >> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: >> > Is there a block of RTL PUA also? >> >> No. > > Perhaps there should be? > > What about designating a part of the PUA to have a specific property? Only > certain properties matter enough: > * wide > * RTL > * combining > as most others are better represented in the font itself. > > This could be done either by parceling one of existing PUA ranges: planes 15 > and 16 are virtually unused thus any damage would be negligible; or perhaps > by allocating a new range elsewhere. I don't think it's a good idea. I think PUA users should provide the properties of the characters used in a form analogical to the Unicode itself, and the software should be able to use this additional information. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Private Use areas
On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: > > Is there a block of RTL PUA also? > > No. Perhaps there should be? What about designating a part of the PUA to have a specific property? Only certain properties matter enough: * wide * RTL * combining as most others are better represented in the font itself. This could be done either by parceling one of existing PUA ranges: planes 15 and 16 are virtually unused thus any damage would be negligible; or perhaps by allocating a new range elsewhere. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition: ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17] ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37] ⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]
Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
On Tue, 21 Aug 2018 08:53:18 +0800 via Unicode wrote: > On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote: > > Still, maybe it > > doesn't really matter much: your special-purpose font can treat any > > codepoint any way it likes, right? > Not all properties come from the font. For example a Zhuang character > PUA font, which supplements CJK ideographs, does not rotate > characters 90 degrees, when change from RTL to vertical display of > text. Isn't that supposed to be treated by an OpenType feature such as 'vert'? Or does the rendering stack get in the way? However, one might need reflowing text to be about 40% WJ. Richard.
Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
Doug Ewell wrote: > Yes, you run the risk of someone else's PUA implementation colliding with > yours. That's why you create a Private Use Agreement, and make sure it's > prominently available to people who want to use your solution. It's not like > there are hundreds of PUA schemes anyway. Yes, that is generally true. However, a situation where that does not matter is if one just wishes to include some specially designed glyphs of one's own design in a PDF (Portable Document Format) document and one uses a Private Use Area encoding simply so that the PDF document with a subset of the glyphs of the font embedded in the PDF can be produced using a desktop publishing program. That is, one makes the font, one installs the font, one uses the font within the desktop publishing package. I have used that technique and the technique worked very well as the Windows operating system treated my font the same way as it did other fonts. With the desktop publishing package that I am using (Serif PagePlus version X7) that is only using the plane zero Private Use Area. Thus the providing of information to anyone reading the PDF document is as displayed glyphs rather than as code points. The availability of the Private Use Area allowed me to make such code point assignments for the glyphs that I had designed and then use those code points in a manner entirely compatible with The Unicode Standard. William Overington Monday 20 August 2018
Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote: On 08/20/2018 03:12 PM, Mark Davis ☕️ via Unicode wrote: ... some people who would call a PUA solution either batty > or crazy. I don't think it is either batty or crazy. People can certainly use the PUA to interchange text (assuming that they have downloaded fonts and keyboards or some other input method beforehand), and it can definitely serve as a proof of concept . Plain symbols — with no interactions between them (like changing shape with complex scripts), no combining/non-spacing marks, no case mappings, and so on — are the best possible case for PUA. It is kind of a bummer, though, that you can't experiment (easily? or at all?) in the PUA with scripts that have complex behavior, or even not-so-complex behavior like accents & combining marks, or RTL direction (here, also, am I speaking true? Is there a block of RTL PUA also? I guess there's always RLO, but meh.) Still, maybe it doesn't really matter much: your special-purpose font can treat any codepoint any way it likes, right? Not all properties come from the font. For example a Zhuang character PUA font, which supplements CJK ideographs, does not rotate characters 90 degrees, when change from RTL to vertical display of text. John Knightley ~mark
Re: Private Use areas
On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: Is there a block of RTL PUA also? No. --Ken
Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
On 08/20/2018 03:12 PM, Mark Davis ☕️ via Unicode wrote: > ... some people who would call a PUA solution either batty > or crazy. I don't think it is either batty or crazy. People can certainly use the PUA to interchange text (assuming that they have downloaded fonts and keyboards or some other input method beforehand), and it can definitely serve as a proof of concept . Plain symbols — with no interactions between them (like changing shape with complex scripts), no combining/non-spacing marks, no case mappings, and so on — are the best possible case for PUA. It is kind of a bummer, though, that you can't experiment (easily? or at all?) in the PUA with scripts that have complex behavior, or even not-so-complex behavior like accents & combining marks, or RTL direction (here, also, am I speaking true? Is there a block of RTL PUA also? I guess there's always RLO, but meh.) Still, maybe it doesn't really matter much: your special-purpose font can treat any codepoint any way it likes, right? ~mark
RE: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
Mark Davis wrote: > The only caution I would give is that people shouldn't expect general > purpose software to do anything with PUA text that depends on > character properties. Very true, and a good point. People with creative PUA ideas do sometimes expect this to magically work. I have anecdotes, if anyone is interested off-list. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
> ... some people who would call a PUA solution either batty > or crazy. I don't think it is either batty or crazy. People can certainly use the PUA to interchange text (assuming that they have downloaded fonts and keyboards or some other input method beforehand), and it can definitely serve as a proof of concept . Plain symbols — with no interactions between them (like changing shape with complex scripts), no combining/non-spacing marks, no case mappings, and so on — are the best possible case for PUA. The only caution I would give is that people shouldn't expect general purpose software to do anything with PUA text that depends on character properties. Mark On Mon, Aug 20, 2018 at 8:52 PM Doug Ewell via Unicode wrote: > James Kass wrote: > > > As a caveat, some Unicode cognoscenti express disdain for the PUA, so > > there would be some people who would call a PUA solution either batty > > or crazy. > > I'm concerned that the constant "health warnings" about avoiding the PUA > may have scared everyone away from this primary use case. > > Yes, you run the risk of someone else's PUA implementation colliding > with yours. That's why you create a Private Use Agreement, and make sure > it's prominently available to people who want to use your solution. It's > not like there are hundreds of PUA schemes anyway. > > Yes, you will have to convert any existing data if the solution ever > gets encoded in Unicode. That happened for Deseret and Shavian, and > maybe others, and the sky didn't fall. > > People forget that it was the PUA in Shift-JIS, by Japanese mobile > providers, that provided the platform for emoji to take off to such an > extent that... well, we know the rest. If private-use is good enough for > a legacy encoding, it ought to be good enough for Unicode. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > >
Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))
James Kass wrote: > As a caveat, some Unicode cognoscenti express disdain for the PUA, so > there would be some people who would call a PUA solution either batty > or crazy. I'm concerned that the constant "health warnings" about avoiding the PUA may have scared everyone away from this primary use case. Yes, you run the risk of someone else's PUA implementation colliding with yours. That's why you create a Private Use Agreement, and make sure it's prominently available to people who want to use your solution. It's not like there are hundreds of PUA schemes anyway. Yes, you will have to convert any existing data if the solution ever gets encoded in Unicode. That happened for Deseret and Shavian, and maybe others, and the sky didn't fall. People forget that it was the PUA in Shift-JIS, by Japanese mobile providers, that provided the platform for emoji to take off to such an extent that... well, we know the rest. If private-use is good enough for a legacy encoding, it ought to be good enough for Unicode. -- Doug Ewell | Thornton, CO, US | ewellic.org