Re: Solidus variations
Murray's work comes from the desire to represent mathematical equations faithfully, based nearly entirely on the semantics of the operators and having those operators be represented as Unicode characters. One solution that he uses is the use of redundant parens. Parens can be supplied to group operands, so that you get the correct precedence, but, where they are not necessary to the human reader, they will be dropped in the formatted equation. As input format, the linear format, therefore looks more like current source code, in that one does type parens. When fractions are built up, you don't need the parens, so they are dropped in layout. If you take the same fraction and display it inline (with a slash) some or all of the parens would be needed for the human reader as well, so those are displayed. How would you parse 5.5 if input as a fraction? 51/2? You do need some form of grouping to recognize that the 5 and the 1 are not part of the same numerator. A./
Re: Continue: Glaring mistake in nomenclature , should it have been Assamese ?
On 9/14/2011 11:14 AM, Michael Everson wrote: At this point, I think I have to make a plea: Sarasvati, spare us. +1
Re: Need for Level Direction Mark
On 9/13/2011 6:01 AM, Philippe Verdy wrote: Unfortunately, adding controls would imply the creation of new Bidi classes for them (and forgetting the stability policy about them, which was published too soon before solving evident problems). The first part is correct, and giving up stability to that degree would be a serious issue. I disagree with the second part. True plaintext bidi will always be a compromise, because there's a lack of information on the intent of the writer. (In rich text, you can supply that with styles). There's a limited workaround with bidi controls, but that's beginning to be a form of minimal rich text in itself. Stability is paramount for predictability. You need to be able to predict what your reader will see, and you will only be able to do that, when you can rely on all implementations agreeing on the details of how to lay out bidi. Introducing any new feature now, will result in decades of implementations having different levels of support for it. This makes the use of such a new feature unpredictable - and is a problem whether there was a formal stability guarantee or not. A./
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
On 9/9/2011 8:12 PM, Stephan Stiller wrote: Dear Martin, Thanks for alerting me to the issue of causal direction of aesthetic preference - it's been on my mind, but your reply helps me sort out some details. When I first encountered text (outside of the German language locale) with ample use of ligatures in modern printed text, I definitely found the ligatures a bit distracting, but partly just because I wasn't used to them. I also perceived them as a solution to what (in Germany) appeared to me to be a real non-issue. Put simply, there is a conflict between full flexibility for font designs and the burden imposed by sophisticated ligatures and kerning tables. From my background I never perceived a need, but I guess I (and most people??) wouldn't really mind the tradition coming back (in Germany) if things are designed well (which is the job of the font designer) and for the user everything is handled automatically in the background by the available technology ... Which cannot happen for German, as it is one of the languages where the same letter pair may or may not have a ligature based on the *meaning* of the word - something that you can't automate. We had famous discussions on this list on this subject. Take an st ligature. There are two meanings for the German word Wachstube, only one allows the st ligature. A human would have to decide when the ligature is appropriate. (Incidentally, the same goes for hyphenation for this word, one meaning allows a hyphen after the s the other does not). Certain layout processes, in certain cases, in certain languages, simply can't be fully automated. A./ Stephan
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/31/2011 11:25 PM, Philippe Verdy wrote: 2011/9/1 Karl Williamsonpub...@khwilliamson.com: But now that I'm an UTC member, I hope I will hear these cases earlier... Congratulations! Does it justify so many new aliases at the same time ? No. I'm firmly with you, I support the requirement for 1 (ONE) alias for control codes because they don't have names, but are used in environments where the need a string identifier other than a code point. (Just like regular characters, but even more so). I also support the requirement for 1 (ONE) short identifier, for all those control AND format characters for which widespread usage of such an abbreviation is customary. (VS-257 does not qualify). Further, I support, on a case-by-case basis the addition of duplicate aliases for reasons of compatibility. I would expect these compatibility requirements to be documented for each character in sort of proposal document, not just a list of entries in a draft property file. Finally, I don't support using the name of any standard, iso or otherwise, as a label in the new status field. It sets the wrong precedent. I've not checked the history of all past versions of UAX, UTR, and UTN (or even in the text of chapters of the main UTS)... Are there other cases in those past versions, that this PRI should investigate and track back ? My preference would be to start this new scheme of with a minimum of absolutely 100% required aliases. Anything even remotely doubtful should be removed for further study. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/28/2011 9:46 PM, Doug Ewell wrote: Philippe Verdy wrote: If there are other mappings to do with other standards, and those standards must be only informative, we already have the /MAPPINGS directory beside the /UNIDATA directory where the UCD belongs too. But in general, with the exception of MAPPINGS/VENDORS/MISC/SGML.TXT, the MAPPINGS directory isn't really a place for character *name* mappings. It's primarily a place for *code point* mappings, for identifying U+0430 CYRILLIC SMALL LETTER A with 0xD0 in ISO 8859-5, and 0xC0 in Windows-1251, and 0xE0 in MacCyrillic, and 0xC1 in KOI8-R. Character names in other standards, like 'acy' for U+0430, are comparatively less important. Right, however NAME mapping has not been a major issue - except for control codes, since Unicode did not name these, even though they were routinely named by people dealing with them. It's really important to not jump off the deep-end and appear to create a precedent for name MAPPING across standards when what is desired is to have IDENTIFIERS for certain characters as well as SHORT IDENTIFIERS for characters very commonly referred to by identifier in source code (regular expressions, etc.). A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/28/2011 6:43 PM, Philippe Verdy wrote: 2011/8/27 Asmus Freytagasm...@ix.netcom.com: I also think that the status field iso6429 is badly named. It should be control, and what is named control should be control-alternate, or perhaps, both of these groups should become simply control. I think the labels chosen by the data file just set up bad precedents. If 6429, why not a section for 9535 (or whatever the kbd standard is) etc. Thanks a lot for admitting what I was trying to demonstrate to you in a prior message (which was early dismissed as a complete non-starter). You appeared to be making a non-starter proposal, rather than clearly making a hypothetical proposal designed only to showcase certain logical flaws in the PRI. If the latter was your intention, well we misunderstood you, but everybody seems to be on the same page, which is good. I lso think that there are too many aliases for controls, if the only need is for Perl to have a name to uniquely designate those controls. Choose one alias name, but there's absolutely no emergency for now for adding four aliases at once for them, when there's no demonstration that all those aliases are needed! This is just unnecessary pollution of the UCS namespace. I tend to agree - however, I do think giving the common abbreviations some formal status is useful. If I remember correctly, even in Perl there were some names that are legacy names. If programs other than Perl have an active need to support legacy names, then I would favor adding these one-by-one as demonstrated needs arise, but NOT wholesale, just because they existed in 6429 in some version. Now, here's a subtle point: adding certain alias strings to the file is a cheap way for the editing tools that verify the uniqueness of the namespace to reserve a name (so it can't ever be given to a different character). Kind of like what happened to BELL. I bet a big motivation behind the long list (all for control codes) was to prevent any non-control code from ever getting a name that happens to match a known control code name. While I appreciate that sentiment, I think this part of the proposal should not be rushed - aliases are forever, and warehousing all known obsolete names for control codes is a bit bizarre. I think you and I are possibly in agreement on that. If there are other mappings ... I've replied on the issue of mappings in reply to Doug's message. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/26/2011 10:09 PM, Philippe Verdy wrote: 2011/8/27 Asmus Freytagasm...@ix.netcom.com: I agree with Ken that Phillipe's suggestion of conflating the annotations for mathematical use with formal Unicode name aliases is a non-starter. Yes but why then adding ISO 6429 alias names ? What makes ISO 6429 a better choice than another ISO standard, that you want to reject as a non-starter option in the normative UCS namespace ? Because, had the control codes been treated like normal characters, they would have named for their 6429 counterparts. For these characters, not having any formal identifiers in the standard created the problem that these aliases are now trying to fix. And why dropping some naming rules for some the proposed alias names, if this namespace also has normative rules ? If you want consistency, those aliases could as well be informative only, and not part of the UCS namespace, avoiding some of its restrictions, i.e. not defined in the UCD itself but in a separate database. I think, the naming rules issue is a bug. And needs to be dealt with in revising the draft. I've already answered that, but you haven't seen the answer come through on the list. And you did not reply to the question about the stability of the related standard using these aliases, compared to the stability requirement for the UCS namespace: if there's no such stability, the normative reference in the UCD will remain only informative for the other standard, creating possible future conflicts. This is no different than for character names derived from other standards. If those standards subsequently change designators, too bad. You misconstrue the issue slightly. Unicode would not make a normative reference, it would copy, once, a particular name and use it as an alias. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/26/2011 7:52 PM, Benjamin M Scarborough wrote: Are name aliases exempted from the normal character naming conventions? I ask because four of the entries have words that begin with numbers. 008E;SINGLE-SHIFT 2;control 008F;SINGLE-SHIFT 3;control 0091;PRIVATE USE 1;control 0092;PRIVATE USE 2;control This is a good point. While the restriction is silly, the character name matching rules disregard both hyphens and spaces. Under the rules, the following strings are all equivalent SINGLE-SHIFT1 SINGLE-SHIFT 1 SINGLE-SHIFT-1 SINGLE SHIFT 1 SINGLE SHIFT-1 (as would the dozens of permutations that introduced hyphens / spaces at other positions) Given those matching rules, if the formal alias were to be SINGLE-SHIFT-1 any programming environment could still recognize the name SINGLE-SHIFT 1 as underneath, all would match to SINGLESHIFT1 Perhaps the formal aliases should be corrected in the draft file to simply follow the established naming conventions, without introducing yet another level of exception. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/27/2011 1:31 AM, Andrew West wrote: On 27 August 2011 09:25, Andrew Westandrewcw...@gmail.com wrote: On 27 August 2011 03:52, Benjamin M Scarborough benjamin.scarboro...@utdallas.edu wrote: Are name aliases exempted from the normal character naming conventions? I ask because four of the entries have words that begin with numbers. 008E;SINGLE-SHIFT 2;control 008F;SINGLE-SHIFT 3;control 0091;PRIVATE USE 1;control 0092;PRIVATE USE 2;control ISO 6429 (and consequently ISO/IEC 10646 Section 11) calls these characters: SINGLE-SHIFT TWO SINGLE-SHIFT THREE PRIVATE USE ONE PRIVATE USE TWO http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf Changing their names to SINGLE-SHIFT 2 or SINGLE-SHIFT-2 etc is surely contrary to the whole point of the exercise. Sorry, ignore that. I hadn't noticed that the digit forms were in addition to the forms with numbers written as words. Actually, you brought something to my attention that I had missed on reading the file, so I won't ignore this. Having these ill-formatted names *in addition* to essentially the same name, but one that follows the naming conventions strikes me as silly. It would set a potential precedent for adding aliases for any character name containing either a digit or a the name for that digit. The PRI gives no rationale for the inclusion of names valid in earlier versions. If there's a known deviation that is currently supported (as named character ID, such as in regular expressions) in widely distributed software, I would support the addition on compatibility grounds (with tweaks that follow the naming rules). But simply because a name existed once (but was later deprecated) strikes me as going into the same encyclopedic direction that Ken himself has disavowed. I do think now that grouping the file is a bad idea, because several people in this discussion, myself included, missed these particular near duplicates. The natural thing is wanting to know all names/aliases for a character. If someone needs grouping for some purposes, a spreadsheet or other tool can easily be used to filter by status field. I also think that the status field iso6429 is badly named. It should be control, and what is named control should be control-alternate, or perhaps, both of these groups should become simply control. I think the labels chosen by the data file just set up bad precedents. If 6429, why not a section for 9535 (or whatever the kbd standard is) etc. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
I agree with Ken that Phillipe's suggestion of conflating the annotations for mathematical use with formal Unicode name aliases is a non-starter. The former exist to help mathematicians identify symbols in Unicode, when they know their name from entity lists. The latter are designed to allow programmers to support identifiers that match existing usage -- mainly for characters for which there currently is not any well defined ID, or for characters for which their abbreviated name is their de-facto name. In a limited number of cases, that would lead to multiple aliases for the same character. The ideal is, as always, to have single identifiers per character, where possible. In a few exceptional cases, allowing alternate IDs via the NameAlias technique is of such overwhelming practical use to support an exception. Aliases come from the same namespace as character names, and must be unique, so that they can be used to unambiguously identify a character. They are intended to be used in programmatic interfaces, for example regular expressions. Adding redundant identifiers comes at a cost: all implementations have to rev their name tables, and using recently added aliases might not be portable until all implementations have caught up. That's why proposals to add additional aliases to any *existing* character should have to pass a really high bar. (I find the rationale for this initial expansion well thought ought and defensible - leaving the control codes unnamed in 10646 has proven problematic to implementers). There's no strict limit to *informative* aliases for characters, nor is there a uniqueness requirement. If there are important real world designations under which certain characters are known, they could be documented with informative aliases. These informative aliases are then available to user interface designers who wish to support a search for character by name feature. Unlike the case for program source code, such interfaces can handle multiple hits for the same name - by presenting a list, for example. Utlimately, even in this case, some annotations are better presented in special purpose files than informative records in the nameslist. That was done for mathematics. If there are other fields where there were established conventions for naming symbols, perhaps someone could provide an analogous list - but it should have no bearing on the PRI under consideration. A./
Re: Code pages and Unicode
On 8/24/2011 7:45 PM, Richard Wordingham wrote: Which earlier coding system supported Welsh? (I'm thinking of 'W WITH CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical decompositions incompatible with the character encodings of legacy systems? Latin-1 has the same codes as ISO-8859-1, but that's as far as having the same codes goes. Was the use of combining jamo incompatible with legacy Hangul encodings? See, how time flies. Early adopters were interested in 1:1 transcoding, using a single 256 entry table for an 8-bit character set, with guaranteed predictable length. Early designs of Unicode (and 10646) attempted to address these concerns, because they promised severe impediments to migration. Some characters were included as part of the merger, without the same rigorous process as is in force for characters today. At that time, scuttling the deal over a few characters here or there would not have been a reasonable action. So you will always find some exceptions to many of the principles - which doesn't make them less valid. Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. Remembering that there is a guarantee that there will be no more surrogate points, an extension form has to be non-conformant with current UTF-16! And that's the reason why there's no interest in this part of the discussion. Nobody will need an extension next Tuesday, or in a decade or even in several decades - or ever. Haven't seen an upgrade to Morse code recently to handle Unicode, for example. Technology has a way of moving on. So, best thing is to drop this silly discussion, and let those future people that might be facing a real *requirement* use their good judgment to come to a technical solution appropriate to their time - instead of wasting collective cycles of discussion how to make 1990's technology work for an unknown future requirement. It's just bad engineering. Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit range. I disagree (as would anyone with a bit of long-term perspective). Nobody needs to look into this for decades, so let it rest. A./
Re: Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)
On 8/23/2011 7:22 AM, Doug Ewell wrote: Of all applications, a word processor or DTP application would want to know more about the properties of characters than just whether they are RTL. Line breaking, word breaking, and case mapping come to mind. I would think the format used by standard UCD files, or the XML equivalent, would be preferable to making one up: The right answer would follow the XML format of the UCD. That's the only format that allows all necessary information contained in one file, and it would leverage of any effort that users of the main UCD have made in parsing the XML format. An XML format shold also be flexible in that you can add/remove not just characters, but properties as needed. The worst thing do do, other than designing something from scratch, would be to replicate the UnicodeData.txt layout with its random, but fixed collection of properties and insanely many semi-colons. None of the existing UCD txt files carries all the needed data in a single file. A./
Re: Code pages and Unicode
On 8/23/2011 12:00 PM, Richard Wordingham wrote: On Mon, 22 Aug 2011 16:18:56 -0700 Ken Whistlerk...@sybase.com wrote: How about Clause 12.5 of ISO/IEC 10646: 001B, 0025, 0040 You escape out of UTF-16 to ISO 2022, and then you can do whatever the heck you want, including exchange and processing of complete 4-byte forms, with all the billions of characters folks seem to think they need. Of course you would have to convince implementers to honor the ISO 2022 escape sequence... Which they only need to if the text is in an ISO 2022 or similar context. Your idea does suggest that a pattern of highhighSOlow would be reasonable. I don't see where Ken's reply (as quoted) suggests anything like that. What he wrote is that, formally, 10646 supports a mechanism to switch to ISO 2022. Therefore, formally, there's an escape hatch built in. If and when such should be needed, in a few hundred years, it'll be there. Until then, I find further speculation rather pointless and would love if it moved off this list (until such time). A./
Re: RTL PUA?
On 8/21/2011 7:34 PM, Doug Ewell wrote: So what you are asking about is a directional control character that would assign subsequent characters a BC of 'AL', right? You don't want to call this a LANGUAGE MARK or anything else that implies language identification, because of the existence of real language identification mechanisms and the history of Unicode and language tagging. An ARM (Arabic RTL Mark) would be a sensible addition to the standard. It would close a small gap in design that currently prevents a fully faithful plain text export of bidi text from rich text (higher level protocol) formats. In a HLP you can assign any run to behave as if it was following a character with bidi property AL. When you export this text as plain text, unless there is an actual AL character, you cannot get the same behavior (other than by the heavy-handed method of completely overriding the directionality, making your plain text less editable). So, yes, there's a bit of a use case for such a mark. (It's effect is limited to treatment of numeric expressions, so it's not an Arabic language mark, but one that triggers the same bidi context as the presence of an Arabic Script (AL) character.) A./ -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Richard Wordinghamrichard.wording...@ntlworld.com Sender: unicode-bou...@unicode.org Date: Mon, 22 Aug 2011 03:19:39 To: Unicode Mailing Listunicode@unicode.org Subject: Re: RTL PUA? On Sun, 21 Aug 2011 23:55:46 + Doug Ewelld...@ewellic.org wrote: What's a LANGUAGE MARK? There are *three* strong directionalities - 'L' left-to-right, 'AL' right-to-left as in Arabic, 'R' right-to-left (as in Hebrew, I suspect). 'AL' and 'R' have different effects on certain characters next to digits - it's the mind-numbing part of the BiDi algorithm. With one a $ sign after a string of European (or is it Arabic?) digits appears on the left and in the other it appears on the right. I can't remember whether 'higher-level protocols' have an effect on this logic. LRM has a BC of L, RLM has a BC of R, but no invisible character has a BC of AL. That's why I tentatively raised the notion of ARABIC LANGUAGE MARK. Incidentally, an RLO gives characters with a temporary BC of R, not AL. Richard.
Re: Implement BIDI algorithm by line
Huh? What context is this in? On 8/22/2011 11:18 AM, CE Whitehead wrote: Hi. I think many line breaks within paragraphs are soft line breaks but that embedding levels have to be taken into account when deciding the width of the glyphs; that's as near as I can tell. Here is the description of the algorithm -- is this what you have read? http://unicode.org/reports/tr9/ Some rules are in fact applied after the line wrapping (after the soft breaks) -- The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis/and are applied *after* any line wrapping is applied to the paragraph./ Logically there are the following steps: * The levels of the text are determined according to the previous rules. * The characters are shaped into glyphs according to their context /(taking the embedding levels into account for mirroring)./ * The accumulated widths of those glyphs /(in logical order)/ are used to determine line breaks. * For each line, rules L1 http://unicode.org/reports/tr9/#L1–L4 http://unicode.org/reports/tr9/#L4 are used to reorder the characters on that line. (I'd have to reread the whole document on line breaking then on bidi to answer this truely; sorry; hope this helps anyway) --C. E. Whitehead cewcat...@hotmail.com
Re: RTL PUA?
On 8/21/2011 3:31 PM, Richard Wordingham wrote: On Sun, 21 Aug 2011 11:00:26 -0600 Doug Ewelld...@ewellic.org wrote: I think as soon as we start talking about this many scenarios, we are no longer talking about what the *default* bidi class of the PUA (or some part of it) should be. Instead, we are talking about being able to specify private customizations, so that one can have 'AL' runs and 'ON' runs and so forth. I was exploring the consequences to see if there was a one size fits all solution. Someone (you?) suggested ON as a default, and I like it. I think it would also work fairly well for practical CJK applications as well - the only problems are that LRM and RLM would occasionally be needed, and the subtle differences between AL and R would be lost. I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. If your implementation supported the directional overrides, it would be possible to use these to lay out any RTL text in a portable manner. Just enclose any RTL run with RLO and PDF (pop directional formatting). No impact on any existing implementation, no impact on the standard. Those who produce rendering engines that do not support these overrides today could be leaned on to upgrade their implementations - that change would benefit users of non-PUA RTL languages as well (because sometimes, the bidi-algorithm can fail, such as for part numbers, and being able to use RLO is a simple way to stabilize such problematic text). Treating PUA characters as ON is very problematic - their display would become context sensitive in unintended ways. No users of CJK characters would think of using LRM characters, but if text is inserted or viewed in RTL context, it could behave randomly. In contrast, always supplying a RLO override for RTL text (containing PUA characters) would be a simple thing to remember and to get right. A./
Re: RTL PUA?
On 8/20/2011 6:44 PM, Doug Ewell wrote: Would that really be a better default? I thought the main RTL needs for the PUA would be for unencoded scripts, not for even more Arabic letters. (How many more are there anyway?) In any case, either 'R' or 'AL' as the Plane 16 default would be an improvement over having 'L' for the entire PUA. The best default would be an explicit PU - undefined behavior in the absence of a private agreement. However, it helps to remember why the PUAs exist to begin with. The demand came from East Asian character sets, which long had had such private use areas. In their case, the issue of properties did not seriously arise, because the vast bulk of private characters where ideographs. I bet this remains true, and so the original motivation for the suggestion of L as the default would still apply - no matter how unsatisfactory this is from a formal point of view. If maintaining the L default were to fail on the cliff of political correctness (or the fairness argument that has been made) the only proper solution is to use a value of unknown (i.e the hypothetical PU value) for all private use code points. There are some properties where stability guarantees prevent adding a new value. In that case, the documentation should point out that the intended effect was to have a PU value, but for historical / stability reasons, the tables contain a different entry. Suggesting a structure on the private use area, by suggesting different default properties, ipso facto makes the PUA less private. That should be a non-starter. A./
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 8/19/2011 2:35 PM, Jukka K. Korpela wrote: 20.8.2011 0:07, Doug Ewell wrote: Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. And now we think that a little over a million is enough for everyone, just as they thought in the late 1980s that 16 bits is enough for everyone. The difference is that these early plans were based on rigorously *not* encoding certain characters, or using combining methodology or variation selection much more aggressively. That might have been more feasible, except for the needs of migrating software and having Unicode-based systems play nicely in a world where character sets had different ideas of what constitutes a character. Allowing thousands of characters for compatibility reasons, more than ten thousand precomposed characters, and many types of other characters and symbols not originally on the radar still has not inflated the numbers all that much. The count stands at roughly double that original goal, after over twenty years of steady accumulation. Was the original concept of being able to shoehorn the world into sixteen bit, overly aggressive? Probably, because the estimates had always been that there are about a quarter million written elements. If you took the current repertoire and used code-space saving techniques in hindsight, you might be able to create something that fits into 16-bits. But it would end up using strings for many things that are now single characters. But the numbers, so far, show that this original estimate of a quarter million, rough as it was, appears to be rather accurate. Over twenty years of encoding characters have not been enough to exceed that. The million code points are therefore a much more comfortable limit and, from the beginning, assume a ceiling that has ample head-room (as opposed to the can we fit the world in this shoebox approach of earlier designs). So, no, the two cases are not as comparable. A./
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 8/19/2011 3:24 PM, Ken Whistler wrote: On 8/19/2011 2:07 PM, Doug Ewell wrote: Technically, I think 10646 was always limited to 32,768 planes so that one could always address a code point with a 32-bit signed integer (a nod to the Java fans). Well, yes, but it didn't really have anything to do with Java. Remember that Java wasn't released until 1995, but the 10646 architecture dates back to circa 1986. Yep. So more likely it was a nod to C implementations which would, it was supposed, have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, and which would have wanted a signed 32 bit type to work. I suspect, by the way, that that limitation was probably originally brought to WG2 by the U.S. national body, as they would have been the ones most worried about the C implementations of 10646 multi-octet forms. No, it was the Japanese NB, as represented by the individual from Toppan Printing. This limitation was insisted upon in 1991, after the accord on the merger between Unicode and 10646, when 10646 was changed to use a flat codespace, not the ISO 2022-like scheme. And the original architecture was also not really a full 32K planes in the sense that we now understand planes for Unicode and 10646. The original design for 10646 was for a 1- to 4-octet encoding, with all octets conforming to the ISO 2022 specification. It used the option that the working sets for the encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were not used except for the single-octet form, as in 2022-conformant schemes still used today for some East Asian character encodings. And the octets were then designated G (group) P (plane) R (row) and C. The 1-octet form thus allowed 95 + 96 = 191 code positions. The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions The Group octet was constrained to the low set of 94. (This is the origin of the constraint to half the planes, which would keep wchar_t implementations out of negative signed range.) The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions The grand total for all possible forms was the sum of those values or: *631,279,375* code positions (before various *other* set-asides for plane swapping and private use start getting taken into account) This was so mind-bogglingly complicated that it was a deal breaker for many companies. Unicode's more restrictive concept of a character or its combining technology or many other innovations weren't initially seen as its primary benefits by people being faced with evaluating the differences between the formal ISO-backed project and the de-facto industry collaboration forming around Apple and Xerox. But the flat code space, now you were talking. Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. So a lot less than 2.1 billion characters. But I think Doug's point is still valid: 631 million plus code points was still overkill for the problem to be addressed. And I think that we can thank our lucky stars that it isn't *that* architecture for a universal character encoding that we would now be implementing and debating on the alternative universe version of this email list. ;-) Even remembering it makes my head hurt. A./
Re: What are the present criteria...
On 8/18/2011 7:29 AM, Doug Ewell wrote: Karl Pentzlinkarl dash pentzlin at acssoft dot de wrote: The quoted indicators for benefit were part of a concern of the German NB regarding the Wingding/Webding proposals. The concern expressed in WG2 N4085 is that some characters proposed there conform neither to the policy statements by UTC or WG2, nor to the indicators of benefit which the German NB would accept as an additional reason to encode Wingding/Webding characters beyond the formal policies of UTC and WG2. Nevertheless, N4085 is a German NB document, the criteria in question are those suggested by the German NB and not WG2 (and the document makes note of this distinction), and it is an error to portray this passage as representing either a change or a lack of clarity in UTC or WG2 policy. Karl makes no such claim. The document states that 2093-2096 appear to be in violation of the character glyph model. I believe that's the section (or one of the sections) in the document that Karl summarizes here as policy statements by UTC or WG2 - at least it would fit. Anyway, it's more useful to focus on the actual concerns, not about whether Karl summarized them correctly in his email. The German NB introduces the concept of indicator of benefit [to] the user, and then defines that as: - evidence of actual use - evidence that it's likely a wrong character might be used for lack of an encoded character - conformance to other standards (I've slightly rephrased for clarity). I have several problems with this approach. First, these indicators are rather haphazardly compiled. Overwhelming evidence of plain text use, and conformance requirements are already recognized as valid reasons to encode characters (not just symbols). They do not, however, help in evaluating those proposals where more nuanced judgement is required. The third element, that the wrong character might be mistakenly used, is of overriding concern only in particular cases where questions of unification or disambiguation need to be decided. Second, it's really unsatisfactory if each NB has their own criteria for when to add characters to the standard, and it's especially unsettling when such criteria seem to be ad-hoc applied to a given repertoire. WG2 and Unicode have had lengthy discussions and broad consensus about the kinds of criteria to take into account when encoding characters in general or symbols in particular. The result has been captured in a number of documents, for example, here's the original one from the UTC: http://unicode.org/pending/symbol-guidelines.html. (with links to more recent versions). Unlike the list in N4085, the criteria adopted by UTC and WG2 are not formulated as PASS / FAIL. Instead, they were carefully designed to be used in assigning weight in favor or in disfavor of encoding a particular symbol as a character. This recognizes an important principle, which has been notably absent in much recent discussion: it is generally not possible to create any set of criteria that can be applied mechanistically (or algorithmically). The decision to encode a character is and remains a judgement call. Some calls are easy, because the evidence is overwhelming and direct, some calls are more difficult, because the evidence may be uncertain or indirect, or the nature of the proposed character may not be as well understood as one would ideally prefer. Recognizing these inherent difficulties in the encoding work and the need for a set of weighing factors instead of simplistic PASS / FAIL criteria was one the early break-throughs in the work of WG2 and UTC. Accordingly the documents speak not of criteria whether to encode characters, but criteria that strengthen (resp. weaken) the case for encoding. That's a crucial difference. While the details of these criteria (or factors) can and should be evaluated from time to time for continued appropriateness, the soundness of the general methodology is not in question, and UTC and WG2 should resist any attempts (directly or indirectly) to abandon them in favor of an unworkable, simplistic, and ad-hoc PASS / FAIL approach. What are relevant criteria? The document I cited lists the original set of criteria as follows What criteria strengthen the case for encoding? The symbol: * is typically used as part of computer applications (e.g. CAD symbols) * has well defined user community / usage * always occurs together with text or numbers (unit, currency, estimated) * must be searchable or indexable * is customarily used in tabular lists as shorthand for characteristics (e.g. check mark, maru etc.) * is part of a notational system * has well-defined semantics * has semantics that lend themselves to computer processing * completes a class of symbols already in the standard * is letterlike (i.e. should vary with the surrounding font style)
Re: Sanskrit nasalized L
On 8/16/2011 1:57 AM, Andrew West wrote: On 16 August 2011 02:59, Richard Wordingham richard.wording...@ntlworld.com wrote: All I've got to go on is the penultimate sentence in TUS 6.0 Section 10.2 - 'Rarely, stacks are seen that contain more than one such consonant-vowel combination in a vertical arrangement'. http://www.unicode.org/versions/Unicode6.0.0/ch10.pdf#G30110 Which is followed immediately by the caveat: These stacks are highly unusual and are considered beyond the scope of plain text rendering. They may be handled by higher-level mechanisms. That's all well and good. The question is: have any such mechanisms been defined and deployed by anyone? A./ The Tibetan script doesn't have a combining virama. I would expect the natural coding to be something like letter-vowel-subjoined letter-vowel, e.g.U+0F40 TIBETAN LETTER KA, U+0F74 TIBETAN VOWEL SIGN U, U+0FB2 TIBETAN SUBJOINED LETTER RA, U+0F74 TIBETAN VOWEL SIGN U. As the Unicode Standard explicitly states, non-standard stacks such as this (which really are highly unusual, and only occur in a few specific contexts) are outside the scope of plain text rendering, and are not defined by the standard. It therefore makes no sense for you to try to specify character sequences for such non-standard stacks. Andrew
Re: Non-standard Tibetan stacks (was Re: Sanskrit nasalized L)
On 8/16/2011 3:32 PM, Andrew West wrote: On 16 August 2011 18:19, Asmus Freytagasm...@ix.netcom.com wrote: These stacks are highly unusual and are considered beyond the scope of plain text rendering. They may be handled by higher-level mechanisms. The question is: have any such mechanisms been defined and deployed by anyone? In my opinion, until someone produces a scan of a Tibetan text with multiple consonant-vowel sequences, and asks how they can represent it in plain Unicode text there is no question to be answered. Thank you Andrew - that clarifies the issue for the non-specialist. A./ Chris Fynn asked about certain non-standard stacks he was trying to implement in the Tibetan Machine Uni font in an email to the Tibex list on 2006-12-09, but these didn't involve multiple consonant-vowel sequences (one stack sequence was0F43 0FB1 0FB1 0FB2 0FB2 0F74 0F74 0F71 which would be reordered to0F42 0FB7 0FB1 0FB1 0FB2 0FB2 0F71 0F74 0F74 by normalization which would display differently). Other non-standard stacks that I have seen involve horizontal progression within the vertical stack (e.g. yang written horizontally in a vertical stack). More recently, the user community needed help digitizing Tibetan texts that used the superfixed letters U+0F88 and U+0F89 within non-standard stacks, resulting in a proposal to encode additional letters (http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3568.pdf). None of these non-standard stack use cases involved multiple consonant-vowel sequences, and I'm not sure whether I have ever seen an example of such a sequence. I have learnt that there is little point discussing a solution for a hypothetical problem, because when the real problems arise they likely to be something different. Andrew
Re: Greek Characters Duplicated as Latin
On 8/14/2011 1:39 PM, Richard Wordingham wrote: U+00B5 MICRO SIGN is an ISO-8859-1 character, and was therefore included as U+00B5. It normally precedes a Latin-script letter, and therefore it actually makes sense to treat it as a Latin-script character, and possibly give it a different shape in these contexts to the shape of the Greek letter in Greek text. I don't think that there's a strong and overriding reason to give this character a separate shape. As you note, the true reason that this character was encoded separately has to do with the requirement that the first 256 code points of Unicode should match 8859-1, so that simply widening a byte to 16 or 32 bits would transform 8859-1 data to UTF-16 or UTF-32. With the predominance of UTF-8 as format for interchanging Unicode, something that wasn't foreseen from the beginning, this design criteria has lost slightly in importance. However, it helped the migration to Unicode, by making conversion of the vast majority of data (at the time ASCII and 8859-1 accounted for the bulk of existing data on the net) dead simple. With anything as radically different from its predecessors as Unicode, keeping as much familiarity as possible was a major concern. Now, once you list the small mu among the first 256 characters, you then have to ask the question what to do with the Greek alphabet. The basic alphabets are used in so many ways in software (for automatic numbering of headings, etc.) that disrupting this sequence (and leaving out the mu from the Greek alphabet) wasn't a realistic choice. Hence, the duplication. It does not alter the fact, that the micro sign really is just a usage of the Greek small mu, and not actually a new entity. Because the micro sign was widely implemented in systems and fonts that do not support the full set of Greek characters, I wouldn't be surprised to find that there are instances where the design was adjusted to make it fit better in a Latin environment. If so, these developments likely predate Unicode substantially, because this use of mu was supported in older technology as well. I recall seeing it on typewriter keyboard (mechanical). I'm not sure I agree with the need to have a Latinized mu, but it exists and there you have it. Having two separate code points will allow these characters to have a separate development in the future. U+0216 OHM SIGN is similar to U+00B5 MICRO SIGN, except that it is used on its own. Whether it should be merged with U+03A9 GREEK CAPITAL LETTER OMEGA is debatable, but that is what has been done. The Ohm sign should have been encoded as another example of squared letters and abbreviations. It comes from Asian character sets, where, inexplicably, it exists separately from and alongside to the capital Greek Omega - which they also encode. In order to allow loss-less conversion to/from these sets, there was a need to have a code point for the Ohm. The Omega for Ohm was never as widely used as the mu, and it's questionable whether there really was much of a development of a different form for it. The Asian fonts that I knew in the 80's did not have different forms. In modern usage, for new documents, this character should not be used. A./
Re: Anything from the Symbol font to add along with W*dings?
On 8/14/2011 12:51 PM, Jukka K. Korpela wrote: 14.8.2011 17:51, Doug Ewell wrote: This sounds like Jukka expects browsers to analyze the glyph assigned in the font to the code position for 'a' and decline to display it if it doesn't look enough like an 'a' (rejecting, for example, Greek 'α'). I'm not sure that is a reasonable expectation. That wouldn’t be reasonable, but what I expect is that fonts have information about the characters that the glyphs are for and browsers use that information. Something like that is required for implementing the CSS font matching algorithm: http://www.w3.org/TR/CSS2/fonts.html#algorithm Not all documents are HTML or CSS. Font overloading of this kind is common in many rich text documents and not limited to the Symbol font. Yes, it makes text non-portable in certain ways. Private use characters would have been a cleaner way to achieve the same non-portability. Windows will let you use private use characters to access symbol fonts (not just the symbol font), but this feature is not widely used (despite the fact that it dates back to the earliest days of Unicode support on that platform). Why users voted with their feet (or keystrokes) is not a useful topic of speculation. The fact is, they did. The question here is whether it's useful to add code additional points to allow plain-text coverage of certain widely spread fonts (of which the symbol font is one) so that it's possible to use, for example, automated processes to re-encode font runs in older documents to make them more fully portable. If there are indeed some characters missing to complete that goal, the numbers are small and similar fragments of mathematical symbols have been encoded before. I would see not principled objection - only the question whether these are truly still unmapped (however, I haven't researched these particular characters, so I'm not giving any comments related to them in particular). A./
Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)
The ambiguity of an initial FEFF was not desirable, but this discussion shows that certain things can't be so easily fixed by adding characters at a later stage. The more time elapsed between encoding of the ambiguous character and the later fix the more software, the more data, and the more protocols exist that support the original character, creating backwards compatibility issues. Incidentally, this is totally what I expected when the WJ was proposed, but sentiment in favor of its addition ran high at the time... The ZWNBSP was present in Unicode 1.0 (1991) while the WJ was added in 3.2 (2002), that is about 10 years later. We are now an additional 10 years down the road, and instead of clarifying the issue, the interim result is that WJ has muddied the waters instead. Somewhere here are lessons to be learned. A./ -Original Message- From: Doug Ewell d...@ewellic.org Sent: Aug 5, 2011 8:49 AM To: unicode@unicode.org Subject: Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?) Jukka K. Korpela jkorpela at cs dot tut dot fi wrote: So? It was, and it still often is, better to use ISO 8859-1 rather than Unicode, in situations where there no tangible benefit, or just a smal l benefit, from using Unicode. For example, many people are still conservative about encodings in e-mail, for good reasons, so they use ISO 8859-1 or, as you did in your message, windows-1252. A word about my encoding choices. My first message on Thursday was sent from my home PC, using Windows Live Mail, and it used UTF-8 because I configured Windows Live Mail to do so. My second message was sent from my mobile device, and used Windows-1252. I don't know if there is a way to tell the device to use UTF-8 for outgoing messages, but I can say it was not my conscious intent to prefer Windows-1252 over Unicode. This message is being sent via a Web interface; I guess we'll find out what encoding it chooses for me. On the other hand, this isn’t comparable to ZWNBSP vs. WJ. These control characters do the same job in text, as per the standard, so the practical question is simply which one is better supported. ZWNBSP, like WJ, is intended to inhibit breaking between words. Despite the other (and original) intended use of U+FEFF at the start of a text as a byte-order mark, there is a pervasive belief that an initial U+FEFF means the text should be treated as beginning with some kind of space character. This is silly, since there is no concept of between words at the start of a text, but it is nevertheless the way people perceive things. WJ was introduced to encourage users to separate these two functions. If users don't adopt it, the problem will never be solved. There are enough issues in Unicode that cannot be fixed due to stability concerns; it would be nice to be able to fix this one at least. I still question how many real-world texts use either U+FEFF or U+2060 to achieve this non-breaking behavior. ISO 8859-1 and Unicode perform very different jobs, so that using ISO 8859-1, you limit your character repertoire (at least as regards to directly representable characters, as opposite to various “escape notations”). If you don’t need anything outside the ISO 8859-1, the choice used to be very simple, though nowadays it has become a little more complicated (as e.g. Google Groups seems to munge ISO 8859-1 data in quotations but processes UTF-8 properly) UTF-8 has the property of being easily detected and verified as such, which solves part of the Google Groups problem (inability to detect which SBCS is being used). The other part of the problem is the practice of using heuristics to override an explicit charset declaration, but that is a topic for another day. I won’t make any statements about full compliance, but in Microsoft Office Word 2007, U+FEFF alias ZWNBSP does its basic job (inside text) in most situations whereas U+2060 alias WJ seems to be not recognized at all and appears as some sort of a visible box. So to have a job jone, there is not much of a choice. (Word 2007 fails to honor ZWNBSP semantics after EN DASH, which is bad, but it does not make it useless in other situations.) It does always come down to a complaint against Microsoft, doesn't it? Unfortunately, Yucca is right here: opening Word 2007 and pasting a snippet of text with embedded ZWNBSP does display correctly, while the same experiment with embedded WJ shows a .notdef box. This seems to be a font-coverage problem, amplified by Word's silent overriding of user font choices—changing the font from the default Calibri to DejaVu Sans (and optionally back to Calibri) makes the display problem go away, but of course no user could reasonably be expected to go through that. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/17/2011 2:47 AM, Petr Tomasek wrote: On Sun, Jul 17, 2011 at 10:14:55AM +0100, Julian Bradfield wrote: Wouldn't it be more economical to encode a single UNICODE ESCAPE CHARACTER which forces the following character to be interpreted as a printable glyph rather than any control function? I already thought about this but this would probably mean that algorithms (like the Unicode BiDi Algorithm) would have to be changed. Change that to: it would mean that ALL algorithms that interpret any of the invisible characters would have to change. The reason is, of course, because these codes would *reinterpret* existing characters. You could argue that Variation Selectors do the same, but they are carefully constructed so that they can be safely ignored. These suggested character couldn't be safely ignored, because doing so would have control/formatting codes in the middle of text where none were intended. Michael has it right: On 7/17/2011 2:35 AM, Michael Everson wrote: ... invisible and stateful control characters are more expensive than ordinary graphic symbols. In this case, the expense is so much higher as to rule out such an idea from the start. A./ PS: this doesn't mean that adding graphic symbols is the foregone thing to do, only that, if evidence points to the need to address this issue in character encoding, then, using graphic symbols is the better way to go about it.
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webdingproposal)
On 7/17/2011 12:19 PM, Doug Ewell wrote: Asmus wrote: The reason is, of course, because these codes would *reinterpret* existing characters. You could argue that Variation Selectors do the same, but they are carefully constructed so that they can be safely ignored. Variation selectors don't change the interpretation of characters, only their visual appearance. The process of display is part of the more general concept of interpretation as this term is used in the Unicode Standard. A./ PS: and variation selectors don't necessarily even change the visual appearance of a character. If the glyph shape for the given character in the selected font already matches or falls into the glyphic subspace indicated by the variation sequence, then you would not observe any change. (Ditto for display processes that don't support variation selectors, but that's a whole different kettle of fish).
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/17/2011 12:19 PM, Philippe Verdy wrote: 2011/7/17 Asmus Freytagasm...@ix.netcom.com: On 7/17/2011 2:35 AM, Michael Everson wrote: ... invisible and stateful control characters are more expensive than ordinary graphic symbols. In this case, the expense is so much higher as to rule out such an idea from the start. A./ PS: this doesn't mean that adding graphic symbols is the foregone thing to do, only that, if evidence points to the need to address this issue in character encoding, then, using graphic symbols is the better way to go about it. Another alternative: instead of encoding separate symbols for each control, we could as well encode symbols for each character visible in those symbols. E.g. ro represent the glyph for the RLO control, we could encode three characters, one for each of R, L, and O, as DOTTED SYMBOL FOR LATIN CAPITAL LETTTER R, DOTTED SYMBOL FOR LATIN CAPITAL LETTER L, DOTTED SYMBOL FOR LATIN CAPITAL LETTER O. These three symbols would have a representative glyph as the base letter from which they are derived, within a dotted rectangle. Then each of them would contextually adopt one of four glyph forms : the full rectangle, or the rectangle with the left or right side removed, or both sides removed. The selection would be performed selectively. I'm baffled: what problem is this elaborate scheme trying to solve? The problem was never in *how* to encode such symbols, but in *whether* they should be considered *characters* (and therefore need to be supported on the character level of the architecture). That point, whether there's a reasonable use case for them as characters, has not been settled, so the case for thinking about encoding solutions has not been established. When people write about a line feed character, they use LF or linefeed or 000A (or U+000A or 0x0A etc.). They commonly don't use the LF symbol character, nor any other unencoded symbol. I claim, the same is true for ZWJ, RLO, PDF and all the other good characters. Just because Unicode uses dashed box placeholders in the code charts hasn't made them the generally accepted, universally understood *symbols* for these characters. This is different from the pictures for control codes because at the time, these were widely supported in devices, and users of these devices (terminals) were familiar with the convention (staggered small letters) and many would recognize common control characters. So, let's keep a lid on devising ever more arcane and fragile encoding and pseudo-encoding options until there's consensus that this issue must be addressed on the character level. A./
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webdingproposal)
On 7/15/2011 10:48 PM, Doug Ewell wrote: I apologize for the unintended content-free post. It's my phone's fault. -- My dog ate the homework - 2011? :) A./
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/16/2011 1:53 AM, Michael Everson wrote: On 16 Jul 2011, at 04:37, Asmus Freytag wrote: It's not a matter of competing views. There's a well-defined process for adding characters to the standard. It starts by documenting usage. Yes, Asmus, and when one wants to do that, one writes a proposal. We aren't writing a proposal here. We're *talking* about things. I fully understand the difference between making a formal proposal (that can be acted upon) and informally chatting about the possible needs for some characters - and the chances that a successful proposal might be written. However, if the only hard information are assertions of personal preference such as Sometimes I might want to show a dotted box for NBSP and sometimes a real NBSP, it is a bit much to then conclude What I see is a certain unreasonability reflecting a certain conservatism because there isn't an immediate, public enthusiasm for the idea. A./ PS: My counter-assertion, that much of the technical literature uses the abbreviations in preference to dashed boxes, has been pointedly ignored by you. UAX#9, bidi, and UAX#14, linebreak, extensively discuss invisible characters - neither of these documents needs symbol characters, in fact, they would probably reduce clarity. This practice goes back over 15 years, so it can be seen as settled. (I further assert that I expect examples could be found outside the standard as well). PPS: If anybody provides evidence (suitably documented for the level of discussion) of widespread use of symbolic depictions for certain invisible characters, I'd be quite open to review it and to base my future position on this new basis.
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
Karl, I've published similar surveys in the past, where the object was to get feedback on the desirability of further action. I stick by my recommendation in favor of keeping raw data out of the document registry and of doing the committee a favor by adding value in form of a sifting or analysis of such data. Previewing the data is not the same as making a character encoding proposal, and there aren't any procedural rules for non-proposals, so there's nothing that prevents doing that. I have always provided some level of analysis, and I have not always chosen to register all such documents - for the reasons I gave you earlier. The original rationale for encoding certain symbols had been their widespread use. The word widespread is key here. At the time that Unicode was first created, symbol sets associated with printers defined widespread use. After these sets were backed into the 2600 and 2700 blocks, the phenomenal rise of Windows made the W/W-Dings sets even more widespread. As you and WG2 evaluate additional such widely disseminated fonts, you will need to come up with your own criteria of what constitutes widespread. Those criteria should be applied both to the fonts considered as potential source of symbols, as well as to each category of symbols within these fonts. I'll be interested in looking at a list of Apple symbols, once it's categorized a bit better by symbol function and / or gives a better idea of which (and how many) symbols extend existing sets (e.g. by adding directional variants) and which (and how many) might possibly be only variants of existing symbols - and similar information like that. (Unlike a full character encoding proposal I would not expect definite answers to these, but some tentative / approximate information would be nice). A./
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/15/2011 1:08 AM, Karl Pentzlin wrote: In WG2 N4085 Further proposed additions to ISO/IEC 10646 and comments to other proposals (2011‐ 05‐25), the German NB had requested re WG2 N4022 Proposal to add Wingdings and Webdings Symbols besides other points: Also, in doing this work, other fonts widespread on the computers of leading manufacturers (e.g. Apple) shall be included, thus avoiding the impression that Unicode or SC2/WG2 favor a single manufacturer. In supporting this, there is now a quick survey of symbol fonts regularly delivered with computers manufactured by Apple: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4127.pdf - Karl Karl, I believe that publishing this document in its current form is a more of a disservice than a service to the committees or the larger community (a few individuals excepted). There appear to be a large number of symbols for which a Unicode equivalent can be identified with great certainty - and beyond that there seem to be characters for which such an assignment is perhaps more tentative, because of minor glyph differences, but still plausible. I believe that only when these two passes have been carried out, will the document be of any reasonable use to wider audiences - as it is, everybody has to sift through all the characters, even the ones that are uninteresting (because their mappings are not in question, despite lack of glyph names). Using Unibook, you can use the syntactic conventions of canonical and compatibility decomposition listings to show mappings of which you are certain or which look OK, but need verification. Entirely questionable mappings could use the comment convention. In the input file used by Unibook, a TAB=SPACE at the start of a line, followed by a code point can be used to show an identically equal sign with the mapping in the output. A TAB%SPACE would show the approximately equal sign, and a TAB*SPACE would yield a bullet (as for a comment). Finally, you could use yellow (and/or blue) highlighting (or both) to highlight characters needing particular levels of review. Once you have carried the analysis to that stage, the document would indeed be of interest for wider reviewers. It would still not be a proposal, but you would have done the necessary legwork in *analyzing* (or tentatively analyzing) the repertoire. A./
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/15/2011 9:03 AM, Doug Ewell wrote: Andrew Westandrewcwest at gmail dot com replied to Michael Everson: I think that having encoded symbols for control characters (which we already have for some of them) is no bad thing, and the argument about too many characters is not compelling, as there are only some dozens of these characters encoded, not thousands and thousands or anything. I oppose encoding graphic clones of non-graphic characters on principle, not because of how many there are. I agree with Michael about a lot of things, and this isn't going to be one of them. The main arguments I am seeing in favor of encoding are: 1. Graphic symbols for control characters are needed so writers can write about the control characters themselves using plain text. When users outside the character encoding community start reporting such a need in great numbers, it would indicate that there might (might!) be a real requirement. The character coding community has had decades to figure out ways to manage without this - and the current occasion (review of Apple's symbol fonts) is not a suitable context to suddenly drag in something that could have been addressed anytime for the last 20 years, if it had been really urgent. I don't think there's any end to where this can go. As Martin said, eventually you'd need a meta-meta-character to talk about the meta-character, and then it's not just a size problem, but an infinite-looping problem. What real users need is to show hidden characters. That need can be served with different mechanisms. There seems to not be a consensus though, on what the preferred approach should be and implementations disagree. That kind of issue needs to be addressed differently, involving the cooperation of major implementers. 2. The precedent was established by the U+2400 block. I thought those were compatibility characters, in the original sense: encoded because they were part of some pre-existing standard. That's not necessarily a precedent in itself to encode more characters that are similar in nature. Doug is entirely correct. These are a precedent only if an extended set of other such symbols was found in use in some de-facto character set. In that special case, an argument for compatibility with *that* character set could be made. And for that to be successful, it would have to be shown that the character set is widely used and compatibility to it is of critical importance. In addition, I claim, experience has shown that the the control code image characters are not widely used. That means, any hope that the early encoders (and these go back to 1.0) may have had that those symbols are useful characters in their own right, simply have not been borne out. 3. There aren't that many of them. We regularly dismiss arguments of the form But there's lots of room for these in Unicode when someone proposes to encode something that shouldn't be there. I don't see this as any different. Correct. The only time this argument is useful is in deciding between encoding the same character directly or as character sequence. Using character sequences solely because of encoding space reasons, as opposed to the reason that the elements are characters in their own right, has become irrelevant due to the introduction of 16 more planes. The same is true for excessive unification of certain symbols or punctuation characters: saving code space is not a valid argument here - so any decision needs to be based on other facts. Michael is responsible for adding many thousands of characters to Unicode, so it's awkward for me to be debating character-encoding principles with him, but there we are. Well, in this business, no-one's infallible. A./
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/15/2011 2:23 AM, Karl Pentzlin wrote: Am Freitag, 15. Juli 2011 um 10:58 schrieb Asmus Freytag: AF ... There appear to be a large number of symbols for which a AF Unicode equivalent can be identified with great certainty - AF and beyond that there seem to be characters for which such AF an assignment is perhaps more tentative, because of minor AF glyph differences, but still plausible. ... AF ... Once you have carried the analysis to that stage ... My intent was to present the data to people who want to continue the work in this way, and to encourage the discussion of the Apple symbols within the Wingding/Webding discussion in line with the German NB request cited in my original mail. You would serve this goal much better if, instead of rushing to simply add raw data to the document pile, you had narrowed the issue down by limiting this further to characters that need real scrutiny. Such analysis as Asmus requested, done with the appropriate scrutiny and thus requiring a considerable amount of time, in fact is the next logical step on this work. This, however, has not necessarily to be done by myself. So, essentially you are dumping it on everyone. At this early stage (raw list) a better approach would have been to look for collaborators first and then collectively publish a document that provides useful analysis. The document registry should be limited to documents that can and should be reviewed in committee. Raw data collection without or with limited value added do not belong, in my view. A./ PS: I feel strongly enough about this that I will not review the document in its current stage.
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/15/2011 10:26 AM, Michael Everson wrote: What I see is a certain unreasonability reflecting a certain conservatism. Text about the Standard is important, and should be representable in an interchangeable way. Here { } is a Right to left override character. character. I want to talk about it in a way that is visible. Oops. I can't do it interchangeably. Michael, let me give you an example: The Unicode Bidi Algorithm has extensive need to discuss this character, because it provides specification for its use and support by implementations. If you look at that document (UAX#9), you find this character discussed widely (and you can save that document to plain text without losing the sense of that discussion). This example illustrates that we need to distinguish between the requirement to *discuss *characters and their use, and the perceived need to use *symbolic images* (glyphs) to do so. As the example of UAX#9 shows, one does not follow from the other. If there had been a universal requirement to use glyphs for this purpose, this requirement would have surfaced and could have been addressed anytime during the last 20 years. Another indication that this is not a universal requirement can be deduced from the fact that these glyphs do not show up in more font collections. Several symbols for space or blank were added however, because widespread use in documentation was attested. The same avenue should in principle be open for other such symbols (and here I disagree with Andrew and Martin): If widespread use of glyphic symbols (as opposed to abbreviations and names) can be documented for some characters, then those characters, and those characters only should have whatever symbol is used to represent them, added to the standard. Also, like the example for SPACE, if there are different symbols, any of them that is widespread should be added - to unify symbols of different design based on the underlying concept that they represent would constitute improper unification, in my view. So, there, I'm not at all unreasonable - I just reasonably ask that the normal procedures for adding characters are to be followed. In this particular case, the Apple glyphs include glyphs for format characters that Unicode considers deprecated. Providing characters to encode glyphs for them would just be a waste. Further, while the glyphs shown match those from the Unicode code charts, they are not necessarily the shapes that are displayed when systems want to show these invisible characters - so users and documentation writers may need an entirely different set of glyphs. Finally, other vendors seem to not have endorsed these glyphs by including them in their font collections - much unlike the emoji, where multiple vendors had a large overlap of symbols, and with large overlap in glyphic representation as well. Therefore, I strongly urge the committees to separate out these meta characters from the ongoing *symbol collection* review. They can be taken up based on evidence of actual use (and showing the actual glyphs in such use) at a later occasion. A./
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/15/2011 11:05 AM, Doug Ewell wrote: What I see is a certain unreasonability reflecting a certain conservatism. Text about the Standard is important, and should be representable in an interchangeable way. Here { } is a Right to left override character. character. I want to talk about it in a way that is visible. Oops. I can't do it interchangeably. [RTL] or {RTL} or Right-to-Left Override or U+202E might all work. The conventional abbreviations are: RLO (Right-to-left override) RLE (Right-to-left embedding) RLM (Right-to-left mark) -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 7/15/2011 11:36 AM, Michael Everson wrote: However, I agree with Asmus that in the context of the Wingdings-type symbols these characters should not be considered. They should be considered as a whole on their own. Thank you Michael. To reiterate and restate (so it can be read out of context): If widespread use of particular glyphic symbols for certain invisible characters (as opposed to abbreviations and names) can be documented, then those symbols, and those symbols only should are eligible to be added to the standard. As in the example for SPACE, if there are different such symbols denoting the same invisible character, any of them that is widespread could be added. Care should be taken not to unify symbols of different design merely based on the fact that they represent the same invisible character. I simply ask that when and if these symbol characters are considered, the normal procedures for adding characters are to be followed. This includes adducing evidence of their use in documentation (other than the Unicode Standard itself) and similar publications. In particular, such documentation would need to be brought for each individual character (except perhaps for paired characters) as it is quite likely that some invisible characters not documented extensively (for example the deprecated ones). Finally, it would be valuable if research into the use of such glyphic symbols was thorough enough to encompass a more or less complete range of glyphs used for each invisible character, not simply the Unicode chart glyph. A./
Definition of character
Jukka, reminding everyone of the definition of technical term as opposed to a word in everyday language isn't helping address the underlying issue. Everyone is familiar with this distinction. You note that there's a bit of a truism that underlies the definition of character and character encoding, but I would claim this is not limited to Unicode, and has nothing to do with promoting that standard. The truism goes like this: A character is what character encodings encode. As such, character also becomes the smallest unit on which algorithms for processing textual data operate. Historically, character encodings have also encoded, on otherwise equal footing, units that are intended for device control. Over time, some of the device control characters have been redefined as indicators of logical division of text. (TAB and LF are the most prominent examples of this evolution). These historical developments have left us with this and other examples of deep ambiguities in the definition of the members of those sets we call character encodings. These ambiguities are reflected in the technical (as opposed to everyday) usage of the term character. I fully agree with Ken that you can't fix this situation be definitional fiat. Let's look at the putative benefit of a better definition. I think such a benefit has implicitly been claimed to exist, but I would ask for a demonstration in this case. One possible benefit of a solid definition of the members of a set is in helping decide which additional entities should be made members of the set. Can there be a definition of character that provides a solid guidepost for evaluating future proposed character additions to the standard? Over twenty years of work on the Unicode Standard (and decades of work on earlier standards) have clearly demonstrated that it is impossible to devise an algorithm for deciding the question of what candidates are worthy for being encoded in Unicode (or any other character encoding). The problem goes back to the incredible diversity of writing systems and notations and their use. It is further complicated by the fact that breaking down a writing system into elements (identifying the characters) can quite often be done in more than one way. In many instances it's not even obvious which method is the best in a given circumstance. Attempts to base this process on mechanistic rules (driven by definitions) are bound to fail. Hence, characters are the outcome of a creative (human) process of analyzing writing systems. Once you have made a particular analysis, usually ending in an encoding, the elements thus defined are de facto the characters. If you were to accept that it is impossible to rigorously define characters for purposes of making this analysis, the problem becomes simpler. Abstract characters are then entities encoded in one (or more) character encodings, and character is what character encodings encode. Operationally, characters are the smallest units operated on by algorithms that process textual data. Operated on would sidestep the distinctions between characters that represent elements of a writing system like A and what Unicode calls format controls like RLM (or the segmentation characters like PS, LF, TAB). A bit is not the smallest unit, because the algorithms (as logically described) don't operate on bits, they are defined in terms of characters (or sequences of characters). For a fuller definition you might need to make clear that display is covered by process and you might find you need to find a way to cover the traditional use of control characters. They could be described the smallest units operated on by algorithms that control of devices displaying text based on data embedded in a text stream. While there might be some improvement in rewording the glossary entries in this way, doing so neither removes the inherent tautology nor does it eliminate the fact that characters are very diverse in what they represent. But it might make clear that no definition of character will ever be sufficient to serve as input to the process of deciding the question of whether a proposed new entity is or isn't a character. A./
Re: Unicode 7.0 goals and ++
On 7/11/2011 11:57 AM, Ken Whistler wrote: On 7/10/2011 4:58 PM, Ernest van den Boogaard wrote: For the long term, I suggest Unicode should aim for this: That kind of terminological purity isn't going to occur. ... The Unicode Consortium has a glossary of terms: ... But the Unicode Standard is neither a software system nor a protocol stack, so trying to apply models appropriate to other realms probably isn't going to get too far. ... This much is *already* available. S ... Unicode 9.0 should claim: Processes will be defined and published in *UML* 2.0 (for lack of an open standard) (Background: think UAX #9 Bidi written in a universal -graphic- language). This, on the other hand, is not going to happen. I don't see the UTC going for that at all. --Ken I might have the numbering wrong, or ever the sequence. But not the main line, is it? Essentially, as Ken points out, this is not the trajectory that one would look forward to. So I would think you're off about what you call the main line. Not so coincidentally, I fully agree with his conclusions, as well as with the reasoning behind them. A./
Re: Proposed Update UAXes for Unicode 6.1
On 7/7/2011 8:42 PM, Karl Williamson wrote: On 07/07/2011 02:33 PM, announceme...@unicode.org wrote: Proposed updates for most Unicode Standard Annexes for Version 6.1 of the Unicode Standard have been posted for public review. Many of the documents appear to have no current modifications to review other than placeholders for future changes. That means they are proposed to remain unaltered, and / or that actual proposed changes might still be in the works. In either case, if you have an issue with the spec as written, now would be a good time to provide input that can lead to an improvement, correction or extension of the document and / or the corresponding data files. From watching this process over the years, request for incompatible changes that aren't a correction of out and out errors will have a tough time in committee. Editorial clarifications usually have the best success at acceptance, as well as any issues related to improving the handling of newly encoded characters. Anything else will be in-between, subject to some cost-benefit analysis by the UTC, weighing the putative benefits of a change to the cost of not only making it in the documents, but also to existing implementations. A./
Re: unicode Digest V12 #108
On 7/3/2011 6:31 AM, Philippe Verdy wrote: Regarfing the previous comment about the Danish aa, Sorry, most of that discussion missed the mark. Modern Danish can have AA for two reasons. Accidental occurrence, as in dataanalyse which is composed of two words which just happens to put two A together. The other is frozen spellings for names and the like. In the former case, you can never use å, in the latter case, you may not want to. In the former case, you do not want to sort AA as if it was å, in the latter case, you do. None of that has anything to do with ASCII - it's a question of orthographic practices, not of legacy encoding. Because accidental digraphs (in Danish) happen at word boundaries in a compound, the SHY is an elegant way to mark them. A./
Re: unicode Digest V12 #108
On 7/6/2011 12:16 AM, Jukka K. Korpela wrote: Allowing word division just to say that some characters do not constitute a digraph (or trigraph…) is not practical e.g. when the text has otherwise no word divisions, for one reason or another, or when the particular word division point is typographically suboptimal or even bad. I quite agree. But that's been my position from the start. In my very first post in this thread I had written: ...*if* such split [=word division] *is possible*, I would call it [=SHY] the preferred solution to indicating an accidental digraph. The corollary is that it's not a good thing to use SHY when there's no coinciding word division. True digraphs are usually not word division points, but in any language forming compounds, accidental combinations occur at word-division boundaries with some frequency. The Danes, over a decade ago, when they made the official recommendation to use SHY appear to have come to the conclusion that AA can never occur accidentally, except at word division in compounds. A./
Re: unicode Digest V12 #108
On 7/2/2011 8:59 AM, Philippe Verdy wrote: 2011/7/2 Andrew Millera.j.mil...@bcs.org.uk: The ng in Llangollen is not the digram ng but two separate letters (unlike the ll in the name which is the digram). Why not simply using a soft hyphen between n and g in this case ? Soft hyphens are normally recognized as such by smart correctors and as well by search engines or collators. It seems enough for me to indicate that this is not the Welsh digram ng ; CGJ anyway is certainly not the correct disjoiner in your case. This solution works well if the word can split between the n and the g. In fact, if such split is possible, I would call it the preferred solution to indicating an accidental digraph. An example: The Danish digraph aa, normally spelled å in modern orthography, but retained in names etc. can occur accidentally in compound nouns, such as dataanalyse. Adding a SHY is the preferred method to indicate that the aa is accidental. Other characters may have the same effect of breaking the digraph, their use might require an *additional* SHY to be inserted, if and when a linebreak opportunity needs to be manually marked (say for an unusual compound not recognized by the automatic hyphenator). It would be bad to have to have *two* invisible characters at that location.
Re: Typo in bidi reference implementation
On 7/1/2011 12:06 AM, Peter Krefting wrote: Hi! On line 65 of http://www.unicode.org/Public/PROGRAMS/BidiReferenceCpp/bidi.cpp (version 26) the word utility is spelled as uitlity (line 80 has the correct spelling). Not that it matters much, just something we noticed. If it's in a comment, and easily corrected by the reader, I'd lean towards not touching the file. Definitely not something for which one would want to release (and test) a new version. But we could fix the sources so that *if* there's ever a new version, it won't repeat the same issue. A./
Re: Latin IPA letter a
On 6/28/2011 1:51 AM, Michael Everson wrote: On 28 Jun 2011, at 09:28, Jean-François Colson wrote: In Times New Roman, which is the default font for MS Word (probably the best known word processor), the letters “a” and “ɑ” are indistinguishable in italics. That is a fault of the font. No, the font does what it's supposed to, which is to give the correct rendering of the letter 'a' for use in ordinary text. The problem is in Unicode's unification of the generic letter a with the IPA letter 'a', which has a restricted glyph range, and, as we now find out, must be treated differently when styles are applied. Encoding a new character is not the answer. However, encoding a standardized variation sequence would be the proper answer. Insisting that people have control over the font with which text is viewed is to a degree illusory. Not recognizing that fact is a weakness in Unicode's 1980's based design in this instance. A standardized variation sequence makes the IPA nature of the IPA 'a' more portable, while at the same time cleanly allowing text processing software to treat it like the ordinary 'a', when needed, by simply ignoring the variation selector. Why this can be addressed for Han ideographs to the n-th degree, but the few egregious instances of required glyphic subset restrictions can't be made portable for Latin escapes me totally. Time for Unicode to be brought into the 21st century in that respect. A./
Re: Unifon
On 6/28/2011 1:40 AM, Andreas Stötzner wrote: Am 28.06.2011 um 09:43 schrieb Jean-François Colson: I’m interested in Unifon (http://www.unifon.org). That’s a phonemic alphabet for English which is used to teach reading. Although it has been encoded in the ConScript Unicode Registry as a new script in a three-columns block, it has in fact been designed as an extension of the Latin alphabet. Therefore, considering that three fifths of its letters are already available, I wonder whether a proposal shouldn’t be limited to the 16 missing letters. What’s your opinion? Is there a real need for regular encoding? If proposed as kind of extension to Latin there will be one issue at least to be considered carefully: Unifon does not fit the Latin Writing system since it is unicameral, not bicameral (as far as I can see). Same restriction applies to IPA and phonetic notations, all of which have been unified with Latin as far as common letters are concerned. By which I doubtlessly not intend at all to encourage any of the enthusiasts to think they ought now go to their desks and try to invent new lowercase glyphs. More relevant would be who uses this system, where and how widely. The answer to those questions decides, among others, whether any standardization effort is warranted. A./
Re: UNICODE version of _T(x) macro
On 11/23/2010 1:58 AM, sowmya satyanarayana wrote: This what I am actually looking for. My ODBC application supports UTF-16, which is 2 byte width characters. This application is completely oriented around using _T(x) macro as Asmus Freytag figured out. Yeah, it's nice when you can do without, but if your code is filled with _T() macros for function arguments or static initializes, you've got to find a way to make it work. In 2003 there was an attempt to introduce ux to mean treat x as UTF-16 (and Ux to mean treat X as UTF-32). With these extensions beyond L you can write a _T() macro to be precisely UTF-16. I don't know whether any recent compilers, especially in the Unix world have taken up that convention, but it's worth a try to check out whether that solves your problem. At the same time there was an attempt to introduce char16_t and char32_t with guaranteed size and support as UTF-16 and UTF-32. If your compiler supports these, then it may support u and U for initializers. Otherwise, I'm afraid you may be stuck with your solution - but the problem is that you introduce temporary allocations and have memory lifetime issues. I think your sample code would leak memory. In C++ you can define a simple object that's useful for wrapping static strings that are used as function arguments - the object will live just as long as needed (i.e. until the function returns). For other strings, your objects would have to be of global scope. But it's a pain nevertheless. A./
Re: Are Latin and Cyrillic essentially the same script?
On 11/22/2010 4:15 AM, Michael Everson wrote: It boils down to this: just as there aren’t technical or usability reasons that make it problematic to represent IPA text using two Greek characters in an otherwise-Latin system, Yes there are. Sorting multilingual text including Greek and IPA transcriptions, for one. The glyph shape for IPA beta is practically unknown in Greek. Latin capital Chi is not the same as Greek capital chi. so also there are no technical or usability reasons I’m aware of why it is problematic to represent this historic Janalif orthography using two Cyrillic characters. They are the same technical and usability reasons which led to the disunification of Cyrillic Ԛ and Ԝ from Latin Q and W. The sorting problem I think I understand. Because scripts are kept together in sorting, when you have a mixed script list, you normally overrides just the sorting for the script to which the (sort-)language belongs. A mixed French-Russian list would use French ordering for the Latin characters, but the Russian words would all appear together (and be sorted according to some generic sort order for Cyrillic characters - except that for a bilingual list, sorting the Cyrillic according to Russian rules might also make sense.). Same for a French-Greek list. The Greek characters will be together and sorted either by a generic Greek (script) sort, or a specific Greek (language) sort.When you sort a mixed list of IPA and Greek, the beta and chi will now sort with the Latin characters, in whatever sort order applies for IPA. That means the order of all Greek words in the list will get messed up. It will neither be a generic Greek (script) sort, nor a specific Greek (language) sort, because you can't tailor the same characters two different ways in the same sort. That's the problem I understand is behind the issue with the Kurdish Q and W, and with the character pair proposed for disunification for Janalif. Perhaps, it seems, there are some technical problems that would make the support for such mixed-script orthographies not as seamless as for regular orthographies after all. In that case, a decision would boil down to whether these technical issues are significant enough (given the usage). In other words, it becomes a cost-benefit analysis. Duplication of characters (except where their glyphs have acquired a different appearance in the other context) always has a cost in added confusability. Users can select the wrong character accidentally, spoofers can do so intentionally to try to cause harm. But Unicode was never just a list of distinct glyphs, so duplication between Latin and Greek, or Latin and Cyrillic is already widespread, especially among the capitals. Unlike what Michael claims for IPA, the Janalif characters don't seem to have a very different appearance, so there would not be any technical or usability issue there. Minor glyph variations can be handled by standard technologies, like OpenType, as long as the overall appearance remains legible should language binding of a text have gotten lost. That seems to be true for IPA as well - because already, if you use the font binding for IPA, your a's and g's will not come out right, which means you don't even have to worry about betas and chis. IPA being a notation, I would not be surprised to learn that mixed lists with both IPA and other terms are a rare thing. But for Janalif it would seem that mixed Janalif/Cyrillic lists would be rather common, relative to the size of the corpus, even if its a dead (or currently out of use) orthography. I'd like to see this addressed a bit more in detail by those who support the decision to keep the borrowed characters unified. A./
Re: UNICODE version of _T(x) macro
On 11/22/2010 10:18 AM, Phillips, Addison wrote: sowmya satyanarayanasowmya underscore satyanarayana at yahoo dot com wrote: Taking this, what is the best way to define _T(x) macro of UNICODE version, so that my strings will always be 2 byte wide character? Unicode characters aren't always 2 bytes wide. Characters with values of U+1 and greater take two UTF-16 code units, and are thus 4 bytes wide in UTF-16. Not exactly. The code units for UTF-16 are always 16-bits wide. Supplementary characters (those with code points= U+1) use a surrogate pair, which are two 16-bit code units. Most processing and string traversal is in terms of the 16-bit code units, with a special case for the surrogate pairs. It is very useful when discussing Unicode character encoding forms to distinguish between characters (code points) and their in memory representation (code units), rather than using non-specific terminology such as character. If you want to use UTF-32, which uses 32-bit code units, one per code point, you can use a 32-bit data type instead. Those are always 4 bytes wide. The question is relevant to the C and C++ languages. What is asked: which native data type to I use to make sure I end up with a 16-bit code unit. The usual way a _T macro is used is TCHAR x = _T('x'); TCHAR * x = _T(x); that is to wrap a string or character literal so that it can be used either as Unicode literal or as non-Unicode literal, depending on whether some global compile time flat (usually UNICODE or _UNICODE) is set or not. The usual way a _T macro is defined is something like: #ifdef UNICODE #define _T(x) L##x #else #define _T(x) x #endif That defintion relies on the compiler to support L'x' or Lstring by using UTF-16. A few years ago, there was a proposal to amend the C standard to have a way to ensure that this is the case in a cross platform way. I can't recall offhand what became of it. A./
Re: UNICODE version of _T(x) macro
On 11/22/2010 11:08 AM, Asmus Freytag wrote: depending on whether some global compile time flat (usually UNICODE or _UNICODE) is set or not. recte: flag.
Re: Are Latin and Cyrillic essentially the same script?
On 11/18/2010 11:15 PM, Peter Constable wrote: If you'd like a precedent, here's one: Yes, I think discussion of precedents is important - it leads to the formulation of encoding principles that can then (hopefully) result in more consistency in future encoding efforts. Let me add the caveat that I fully understand that character encoding doesn't work by applying cook-book style recipes, and that principles are better phrased as criteria for weighing a decision rather than as formulaic rules. With these caveats, then: IPA is a widely-used system of transcription based primarily on the Latin script. In comparison to the Janalif orthography in question, there is far more existing data. Also, whereas that Janalif orthography is no longer in active use--hence there are not new texts to be represented (there are at best only new citations of existing texts), IPA is as a writing system in active use with new texts being created daily; thus, the body of digitized data for IPA is growing much more that is data in the Janalif orthography. And while IPA is primarily based on Latin script, not all of its characters are Latin characters: bilabial and interdental fricative phonemes are represented using Greek letters beta and theta. IPA has other characteristics in both its usage and its encoding that you need to consider to make the comparison valid. First, IPA requires specialized fonts because it relies on glyphic distinctions that fonts not designed for IPA use will not guarantee. (Latin a with and without hook, g with hook vs. two stories are just two examples). It's also a notational system that requires specific training in its use, and it is caseless - in distinction to ordinary Latin script. While several orthographies have been based on IPA, my understanding is that some of them saw the encoding of additional characters to make them work as orthographies. Finally, IPA, like other phonetic notations, uses distinctions between letter forms on the character level that would almost always be relegated to styling in ordinary text. Because of these special aspects of IPA, I would class it in its own category of writing systems which makes it less useful as a precedent against which to evaluate general Latin-based orthographies. Given a precedent of a widely-used Latin writing system for which it is considered adequate to have characters of central importance represented using letters from a different script, Greek, it would seem reasonable if someone made the case that it's adequate to represent an historic Latin orthography using Cyrillic soft sign. I think the question can and should be asked, what is adequate for a historic orthography. (I don't know anything about the particulars of Janalif, beyond what I read here, so for now, I accept your categorization of it as if it were fact). The precedent for historic orthographies is a bit uneven in Unicode. Some scripts have extensive collection of characters (even duplicates or near duplicates) to cover historic usage. Other historic orthographies cannot be fully represented without markup. And some are now better supported than at the beginning because the encoding has plugged certain gaps. A helpful precedent in this case would be that of another minority or historic orthography, or historic minority orthography for which the use of Greek or Cyrillic characters with Latin was deemed acceptable. I don't think Janalif is totally unique (although the others may not be dead). I'm thinking of the Latin OU that was encoded based on a Greek ligature, and the perennial question of the Kurdish Q an W (Latin borrowings into Cyrillic - I believe these are now 051A and 051C). Again, these may be for living orthographies. /Against this backdrop, it would help if WG2 (and UTC) could point to agreed upon criteria that spell out what circumstances should favor, and what circumstances should disfavor, formal encoding of borrowed characters, in the LGC script family or in the general case./ That's the main point I'm trying to make here. I think it is not enough to somehow arrive at a decision for one orthography, but it is necessary for the encoding committees to grab hold of the reasoning behind that decision and work out how to apply consistent reasoning like that in future cases. This may still feel a little bit unsatisfactory for those whose proposal is thus becoming the test-case to settle a body of encoding principles, but to that I say, there's been ample precedent for doing it that way in Unicode and 10646. So let me ask these questions: A. What are the encoding principles that follow from the disposition of the Janalif proposal? B. What precedents are these based on resp. what precedents are consciously established by this decision? A./
Re: Are Latin and Cyrillic essentially the same script?
On 11/18/2010 8:04 AM, Peter Constable wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of André Szabolcs Szelp AFAIR the reservations of WG2 concerning the encoding of Jangalif Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but rather in view of its potential identity with the tone sign mentioned by you as well. It is a Latin letter adapted from the Cyrillic soft sign, There's another possible point of view: that it's a Cyrillic character that, for a short period, people tried using as a Latin character but that never stuck, and that it's completely adequate to represent Janalif text in that orthography using the Cyrillic soft sign. When one language borrows a word from another, there are several stages of foreignness, ranging from treating the foreign word as a short quotation in the original language to treating it as essentially fully native. Now words are very complex in behavior and usage compared to characters. You can check for pronunciation, spelling and adaptation to the host grammar to check which stage of adaptation a word has reached. When a script borrows a letter from another, you are essentially limited in what evidence you can use to document objectively whether the borrowing has crossed over the script boundary and the character has become native. With typographically closely related scripts, getting tell-tale typographical evidence is very difficult. After all, these scripts started out from the same root. So, you need some other criteria. You could individually compare orthographies and decide which ones are important enough (or established enough) to warrant support. Or you could try to distinguish between orthographies for general use withing the given language, vs. other systems of writing (transcriptions, say). But whatever you do, you should be consistent and take account of existing precedent. There are a number of characters encoded as nominally Latin in Unicode that are borrowings from other scripts, usually Greek. A discussion of the current issue should include explicit explanation of why these precedents apply or do not apply, and, in the latter case, why some precedents may be regarded as examples of past mistakes. By explicitly analyzing existing precedents, it should be possible to avoid the impression that the current discussion is focused on the relative merits of a particular orthography based on personal and possibly arbitrary opinions by the work group experts. If it can be shown that all other cases where such borrowings were accepted into Unicode are based on orthographies that are more permanent, more widespread or both, or where other technical or typographical reasons prevailed that are absent here, then it would make any decision on the current request seem a lot less arbitrary. I don't know where the right answer lies in the case of Janalif, or which point of view, in Peter's phrasing, would make the most sense, but having this discussion without clear understanding of the precedents will lead to inconsistent encoding. A./
Re: Application that displays CJK text in Normalization Form D
On 11/15/2010 2:24 PM, Kenneth Whistler wrote: FA47 is a compatibility character, and would have a compatibility mapping. Faulty syllogism. Formally correct answer but only because of something of a design flaw in Unicode. When the type of mapping was decided on, people didn't fully expect that NFC might become widely used/enforced, making these distinctions appear wherever text is normalized in a distributed architecture. FA47 is a CJK Compatibility character, which means it was encoded for compatibility purposes -- in this case to cover the round-trip mapping needed for JIS X 0213. However, it has a *canonical* decomposition mapping to U+6F22. And that, of course, destroys the desired round-trip behavior if it is inadvertently applied while the data are encoded in Unicode. Hence the need to recreate a solution to the issue of variant forms with a different mechanism, the ideographic variation sequence (and corresponding database). The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47. Easily verified, for example, by checking the FA47 entry in NormalizationTest.txt in the UCD. While correct, it's something that remains a bit of a gotcha. Especially now that Unicode has charts that go to great length showing the different glyphs for these characters, I would suggest adding a note to the charts that make clear that these distinctions are *removed* anytime the text is normalized, which, in a distributed architecture may happen anytime. A./ --Ken When I type ... (U+FA47) into BabelPad, highlight it, and then click the button labeled Normalize to NFC, the character becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard in this case? ...
Re: CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D
On 11/15/2010 5:43 PM, Kenneth Whistler wrote: Perhaps someone would like to make a detailed proposal to the UTC for how to fix the text and charts?;-) Ken, having shown yourself the master of detail in your reply, I think you've appointed yourself. A round of applause for Ken! See how easy that was? :) Cheers, A./ PS: I had something pithy in mind that would work for the charts - I'll send that off to the guy who maintains the nameslist.
Re: Application that displays CJK text in Normalization Form D
On 11/14/2010 12:57 PM, Doug Ewell wrote: Jim Monty jim dot monty at yahoo dot com wrote: Japanese kana (the J in CJK) and Korean syllables (the K in CJK) both have different normalization forms. What do ideographs have to do with anything? I didn't mention ideographs; you did. The term CJK is often used to refer to those characters which are common to Chinese and Japanese and Korean, viz. the ideographic characters. Doug, you might want to talk to the author of UTN#14 then, because he seems to be using the term CJK text in a sense that I find indistinguishable from the way Jim did. Any relation of yours? :) A./ PS: I too think that replacing the CJK text with Katakana and Hangul as a more specific choice, would have been an improvement- as written it makes the problem sound more open-ended than it is. But you guys are arguing about an E-mail subject line, of all things
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
If you want to get that point across to a general audience, you could use a more colloquial term, albeit one that itself derives from mathematics. Text that can be completely expressed in ASCII is fits into something (ASCII) that works as a lowest common denominator of a large number of character sets. You could call it lowest common denominator text. Since ASCII is the only set that exhibits such a lowest common denominator relationship with enough other sets to make it interesting, and since that relation is so well known, it's usually enough to just refer to it by name (ASCII) without needing a general term - except perhaps for general audiences that aren't very familiar with it. In this kinds of discussions I find it invariably useful to mention that the copyright sign is not part of ASCII. (I suspect that it's the most common character that makes a text lose its lowest common denominator status). A./ On 11/10/2010 11:41 AM, Jim Monty wrote: Here's a peculiar question. Is there a standard term to describe text that is in some subset CCS of another CCS but, strictly speaking, is only really in the subset CCS because it doesn't have any characters in it other than those represented in the smaller CCS? (The fact that I struggled to phrase this question in a way that made my meaning clear -- and failed -- is precisely my dilemma.) Text that has in it only characters that are in the ASCII character encoding is also in the ISO 8859-1 character encoding and the UTF-8 character encoding form of the Unicode coded character set, right? I often need to talk and write about text that has such multiple personalities, but I invariably struggle to make my point clearly and succinctly. I wind up describing the notion of it in awkwardly verbose detail. So I'm left wondering if the character encoding cognoscenti have a special utilitarian word for this, maybe one borrowed from mathematics (set theory). Jim Monty
Re: Utility to report and repair broken surrogate pairs in UTF-16 text
On 11/4/2010 5:46 PM, Doug Ewell wrote: Markus Scherer wrote: While processing 16-bit Unicode text which is not assumed to be well-formed UTF-16, you can treat (decode) an unpaired surrogate as a mostly-inert surrogate code point. However, you cannot unambiguously encode a surrogate code point in 16-bit text (because you could not distinguish a sequence of lead+trail surrogate code points from one supplementary code point), and therefore it is not allowed to encode surrogate code points in any well-formed UTF-8/16/32. [All of this is discussed in The Unicode Standard, Chapter 3.] I'm probably missing something here, but I don't agree that it's OK for a consumer of UTF-16 to accept an unpaired surrogate without throwing an error, or converting it to U+FFFD, or otherwise raising a fuss. Unpaired surrogates are ill-formed, and have to be caught and dealt with. The question is whether you want every library that handles strings perform the equivalent of a citizen's arrest, or whether you architect things that the gatekeepers (border control) police the data stream. During development, early and widespread error detection is helpful in debugging. After that, it's probably better to concentrate handling these errors, because that would tend to improve your options for implementing successful error recovery. Malformed data shouldn't get in and shouldn't get perpetuated, but in the general case, there should be a facility for repairing faulty data, wherever that is reasonably possible. In the context of uppercasing a string, for example, repair is not a reasonable option, neither is rejecting the string at that point - it should have been rejected / repaired much earlier. A./
Re: Utility to report and repair broken surrogate pairs in UTF-16 text
On 11/5/2010 7:02 AM, Doug Ewell wrote: Asmus Freytagasmusf at ix dot netcom dot com wrote: I'm probably missing something here, but I don't agree that it's OK for a consumer of UTF-16 to accept an unpaired surrogate without throwing an error, or converting it to U+FFFD, or otherwise raising a fuss. Unpaired surrogates are ill-formed, and have to be caught and dealt with. The question is whether you want every library that handles strings perform the equivalent of a citizen's arrest, or whether you architect things that the gatekeepers (border control) police the data stream. If you can have upstream libraries check for unpaired surrogates at the time they convert UTF-16 to Unicode code points, then your point is well taken, because then the downstream libraries are no longer dealing with UTF-16, but with code points. Doing conversion and validation at different stages isn't a great idea; that's how character encodings get involved with security problems. Note that I am careful not to suggest that (and I'm sure Markus isn't either). Handling includes much more than code conversion. It includes uppercasing, spell checking, sorting, searching, the whole lot. Burdening every single one of those tasks with policing the integrity of the encoding seems wasteful, and, as I tried to explain, puts the error detection in a place where you'll be most likely prevented from doing something useful in recovery. Data import or code conversion routines are in a much better place, architecturally, to allow the user meaningful options to deal with corrupted data, from rejecting to attempts of repair. However, some tasks, such as network identifier matching, are security-sensitive and must re-validate their input, even if the data has already passed a gate keeper routine such as a validating code conversion routine. Corrigendum #1 closed the door on interpretation of invalid UTF-8 sequences. I'm not sure why the approach to handling UTF-16 should be any different.
Re: A simpler definition of the Bidi Algorithm
On 10/17/2010 7:01 AM, Michael D. Adams wrote: This is something that not even the C++ and Java reference implementations do (though it appears that the C++ implementation of the W rules was originally derived from a regular expression as it uses state tables, but if so it is undocumented). (Which by the way they have not been proven to be equivalent, they have merely been tested. Proof is a much more complicated formalism.) Having written the C++ reference implementation, I know a thing or two about it :) The two implementations were not formally proven to be equivalent - in a way that wasn't the purpose. The purpose was to see whether several (in this case two) implementers could read the rules and come up with the same results. When someone creates a real-world implementation to a specification like the Bidi Algorithm, it's not usually verified by formal proof, but by testing. Therefore, the exercise had to with finding out what level of testing was sufficient to capture inadvertent misapplication of some of the less-well-worded rules. (Some of them have since been rewritten to make their intent less ambiguous). Most of the issues were found with the test pass that simply compared all possible sequences up to length 6. That is better than the BidiTest.txt file, which I understand only goes to length 4. Stochastic sampling of sequences up to length 20 resulted in fewer reported discrepancies - again, all of this is from memory. For the test, the maximal depth of embeddings was set to 15 instead of 63, and the input were strings of bidi classes, not raw characters - therefore cutting down on the number of possible sequences. The Java implementation was deliberately designed to be transparent - matching the way the rules are formulated in an obvious way. For the C++ implementation I wanted to do something different, and possibly faster, so I hand-coded a few state tables. The biggest challenge was not in creating those tables, but in understanding the nuances of the rules, by the way. The situation today is not fully comparable, since there was some feedback from the reference implementation project to the rules and especially their wording. A./
Re: A simpler definition of the Bidi Algorithm
On 10/17/2010 10:59 AM, Michael D. Adams wrote: The biggest challenge was not in creating those tables, but in understanding the nuances of the rules, by the way. Two questions so I can understand better. First, by nuances do you mean the nuances of how the rules interact (which I think would be simplified by using a definition as I have proposed) or some other nuance? Neither - as they evolved over time, the rules were revised to more clearly state how to handle certain edge cases and to remove language that could be (and had been) misinterpreted. In other words, the statement of the rules has improved. Now that we have a field-tested set of rules, it's of course easy to re-write them, because you can be certain to know what they mean. Perhaps by going your route, we would have arrived at the same result. Who knows. That's the difference between theory and history. History takes one, and only one of the possible paths to get to a result, and it doesn't give a bit about whether that path was optimal. If you'd been a contributor then, history might well have proceeded differently. Cheers, A./
Re: [unicode] Telugu Unicode Encoding Review
On 10/16/2010 10:38 AM, suzuki toshiya wrote: Hi, I've never heard any comments about the reservation of the codepoints to making the code chart structure similar among multiple script, no posive, no negative. So your comment is interesting. Could you tell me more about what kind of disadvantages you're thinking of? The source for this arrangement is an Indian National Standard. As chapter 9 of TUS states in the introduction: They are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII). The important thing to remember is that when Unicode was first created, it was seen as very important to mimic the layout of 8-bit character sets for a given script - at least for those scripts that had fairly well established standards in the 80s. While this seems quaint now, it did make it easier for people to become comfortable with Unicode - and to be able to tell quickly and reliably whether important character sets were fully covered. Without that, Unicode might never have established itself - as unbelievable as that may sound to those who did not experience that transition period first hand. A./ If Telugu users are using 7-bit or 8-bit encoding and they want to use more codepoints for unencoded characters, the disadvantage (the reduction of the available codepoint) is clear. But... you're talking about Unicode. Regards, mpsuzuki Kiran Kumar Chava wrote (2010/10/17 2:06): Hi, At the link, http://geek.chavakiran.com/archives/55 , I tried to understand Telugu Unicode encoding and then I tried to do an out of box review of this encoding. Kindly let me know if I am missing something, mentioned as missing in above article are really missing or not. Any other views... Thanks in advance, Kiran Kumar Chava http://chavakiran.com
Re: statistics
On 10/11/2010 9:49 PM, Janusz S. Bień wrote: On Mon, 11 Oct 2010 announceme...@unicode.org wrote: The newly finalized Unicode Version 6.0 adds 2,088 characters, What is the current total? Are other statistic informations available somewhere? The announcement gives a link to click through. There you will find more statistics. A./ Best regards JSB
Re: Irrational numeric values in TUS
Ken, some comments, and a few suggestions near the end. On 10/12/2010 4:56 PM, Kenneth Whistler wrote: Karl Williamson asked: The Unicode standard only gives numeric values to rational numbers. Is the reason for this merely because of the difficulty of representing irrational ones? No. Primarily it is because the Unicode Standard is a *character* encoding standard, and not a standard for numeric values for various mathematical constants that some characters might be used to represent. Correct. I consider EULER CONSTANT an unfortunate misnomer from the very, very early days of the Unicode Standard. If we had it to do over, particularly given the later addition of all the styled mathematical alphanumerics, I would have favored: 2107 [insert stylename here] CAPITAL E = Euler constant Or something similar -- just to make the point clearer. Actually, what you advocate here is what I consider the mistake that was made with the WEIERSTRASS ELLIPTIC FUNCTION. The problem is that the Letterlike Symbols were conflated with styled letters used as symbols. They are not at all the same category. The Planck constant is a styled letter used as symbol, and is correctly unified with the italic h, but the planck constant / (2 * pi), or h-bar is not a styled letter but a symbol derived from a styled letter - a true letterlike symbol. 2107 and 2118 are one-off designs, not part of complete sets, same as 210F. Because these characters came from not-well-understood legacy collections, and because the styled letters used as symbols were initially deemed inadmissible to Unicode as complete sets these distinctions weren't clear at the time. NamesList.txt says that U+03C0, GREEK SMALL LETTER PI is used for the ratio of a circle's circumference to its diameter, but it has other uses as well, and does not have the Math property. Having the Math property basically has nothing to do with whether a character is assigned a Numeric_Value or not. Correct. The various Math PI's don't seem that they necessarily mean this value either. Things like the two characters that have Planck's constant in their names, even if the code points always meant that, have different values in different measurement systems, so couldn't be said to refer to particular numbers. I'm curious if any thought was given to this, and what code points I'm missing in my analysis. U+1D452 MATHEMATICAL ITALIC SMALL E (or merely U+0065 LATIN SMALL LETTER E), also used for Euler's number. See also U+2147. Now you are confusing Euler's constant - also depicted with U+03B3 GREEK SMALL LETTER GAMMA, with the natural exponent. That kind of confusion is really not helpful and is what drives people like Karl to ask for numeric property values in the first place - to unambiguously define what these symbols were encoded for. The proper place to document that, without introducing a formal property, is with additional nameslist annotation for a few characters. I suggest that you add the correct value for Euler's constant as a comment and cross reference that character it to 03B3 0.57721 56649 01532 86060 65120 90082 40243 10421 59335 93992 should be approximate enough...? At the same time you could add a comment e ≈ 2.718 for 212F - Again, not to document the value, but to make clear, beyond the character name, what constant the alias for 212F denotes. For that matter, why stop with irrationals? There is also U+1D456 MATHEMATICAL ITALIC SMALL I (or merely U+0069 LATIN SMALL LETTER I), used for the imaginary number, square root of -1. See also U+2148 and U+2149. Basically, there is no end to how mathematicians may end up assigning odder and more exotic kinds of numbers to various symbols available in the standard. And I think how they do so and exactly what those values mean is basically out of scope of the Unicode Standard. Correct - it's not Unicode's role to make the assignment, but common usage can and should be documented informally - that's no different to documenting modifier letters with detailed linguistic usage. A./
Re: 00B7 vs. 2027
On 9/18/2010 8:36 AM, abysta wrote: Hello. I need a dot to separate words into syllables. What should I use, 00B7 or 2027, and why? 2027 is explicitly intended to be used to show syllables as is done in dictionaries. You don't make it explicit in your query, but it sounds like that is the purpose you are looking for. So, don't hesitate, use 2027. The nice thing about 2027 is that you can always filter it back out, because, by intent, it is not part of the word. A./
Re: 00B7 vs. 2027
On 9/18/2010 10:56 AM, Lorna Priest wrote: U+00B7 MIDDLE DOT is semantically ambiguous and has (partly therefore) varying renderings, and it might be used as a replacement for U+2027 if the latter cannot be used reliably. What about using U+02D1 - half triangular colon? Why not use the character that was added to Unicode precisely for the purpose? A./
Re: A simpler definition of the Bidi Algorithm
The first discussions that lead to the current formulation of the bidi algorithm easily go back 20 years by now. There's some value in not re-stating a specification - even if a new formulation could be found to be 100% equivalent. That value lies in the fact that any reader can tell, by simple inspection, that the specification hasn't changed, and that implementations that claim conformance to earlier versions of the specification are indeed still conformant to later versions. This point is particularly important for the bidi algorithm, because of it's mandatory nature and the fact that it gets re-issued with a new version number every time that the underlying Unicode standard gets a new version (because of new characters added, etc). That does not preclude other, equivalent formulations of the algorithm, whether in text books or, perhaps, as technical Note. But the burden is on the creators of these other formulations to show that their supposedly easier or more didactic presentation is indeed equivalent. Having said that, there are already two other formulations of the algorithm that are proven to be equivalent to each other (and have not proven to deviate from the written algorithm). I'm referring of course to the C++ (http://www.unicode.org/Public/PROGRAMS/BidiReferenceCpp/) and Java reference implementations. A./ PS: Personally, I don't find the presentation in terms of the regular expressions any more intuitive than the original.
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
On 8/6/2010 2:03 AM, William_J_G Overington wrote: On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote: I am thinking of where a poet might specify an ending version of a glyph at the end of the last word on some lines, yet not on others, for poetic effect. I think that it would be good if one could specify that in plain text. Why can't a poet find a poetic means of doing that, instead of depending on a standards organization to provide a standard means of doing so in plain text? Seems kind of anti-poetic to me. ;-) --Ken Well, I was just suggesting an example. I am not an expert on poetry. What you mean are artistic or stylistic variants. These have certain problems, see here for an explanation: http://www.unicode.org/forum/viewtopic.php?p=221#p221 A./
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 8/5/2010 3:47 AM, William_J_G Overington wrote: On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote: However, there's no need to add variation sequences to select an *ambiguous* form. Those sequences should be removed from the proposal. Are you here talking about such things as alternate glyph styles? No, I am referring to the element of the proposal that proposes to have a variation sequence that selects the unspecified form for lower case a. It depends what one means by need. I've written a longer answer here: http://www.unicode.org/forum/viewtopic.php?f=9t=83start=0 A./
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 8/2/2010 5:04 PM, Karl Pentzlin wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. - Karl Pentzlin This is an interesting proposal to deal with the glyph selection problem caused by the unification process inherent in character encoding. When Unicode was first contemplated, the web did not exist and the expectation was that it would nearly always be possible to specify the font to be used for a given text and that selecting a font would give the correct glyph. As the proposal noted, universal fonts and viewing documents on other platforms and systems across the web have made this solution unattractive for general texts. We are left then with these five scenarios 1) Free variation 2) Orthographic variation of isolated characters (by language, e.g. different capitals) 3) Orthographic variation of entire texts (e.g. italic Cyrillic forms, by language) 4) Orthographic variation by type style (e.g. Fraktur conventions) 5) Notational conventions (e.g. IPA) For free variation of a glyph, the only possible solutions are either font selection or use of a variation sequence. I concur with Karl, that in this case, where notable variations have been unified, that adding variation selectors is a much more viable means of controlling authorial intent than font selection. If text is language tagged, then Opentype mechanisms exist in principle to handle scenario 2 and 3. For full texts in a certain language, using variation selectors throughout is unappealing as a solution. However, it may be a viable solution for being able to embed correctly rendered citations in other text, given that language tagging can be separated from the document and that automatic language tagging may detect large chunks of text, but not short runs. The Fraktur problem is one where one typestyle requires additional information (e.g. when to select long s) that is not required for rendering the same text in another typestyle. If it is indeed desirable (and possible) to create a correctly encoded string that can be rendered without further change automatically in both typestyles, then adding any necessary variation sequences to ensure that ability might be useful. However, that needs to be addressed in the context of a precise specification of how to encode texts so that they are dual renderable. Only addressing some isolated variation sequences makes no sense. Notational conventions are addressed in Unicode by duplicate encoding (IPA) or by variation sequences. The scheme has holes, in that it is not possible in a few cases to select one of the variants explicitly, instead, the ambiguous form has to be used, in the hope that a font is used that will have the proper variant in place for the ambiguous form. Adding a few variation sequences (like the one to allow the a at 0061 to be the two story one needed for IPA) would fill the gap for times when controlling the precise display font is not available. However, there's no need to add variation sequences to select an *ambiguous* form. Those sequences should be removed from the proposal. Overall a valuable starting point for a necessary discussion. A./
Re: Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)
On 8/4/2010 1:30 PM, verdy_p wrote: Asmus Freytag wrote: The Fraktur problem is one where one typestyle requires additional information (e.g. when to select long s) that is not required for rendering the same text in another typestyle. If it is indeed desirable (and possible) to create a correctly encoded string that can be rendered without further change automatically in both typestyles, then adding any necessary variation sequences to ensure that ability might be useful. However, that needs to be addressed in the context of a precise specification of how to encode texts so that they are dual renderable. Only addressing some isolated variation sequences makes no sense. I don't think so. If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but even in this case, the conversion to long s will be inappropriate. So use the Fraktur round s directly. This statement makes clear that you don't understand the rules of typesetting text in Fraktur. If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s. This statement is also incorrect. The rules when to use long s in Fraktur and when to use round s depend on the position of the character within the word in complicated ways. The same word, typeset using Antiqua style will not usually have the long s. For German, there exist a large number of texts that were typeset in both formats, so you can compare for yourself. Even in France, I suspect that research libraries would have editions of 19th century German classics in both formats. In that case, encode the long s: The text will render with a long s in both modern Latin font styles like Bodoni (with a possible fallback to modern round s if that font does not have a long s), an in classic Fraktur font styles (with here also a possible fallback to Fraktur round s if the Frakut font forgets the long s in its repertoire of supported glyphs). I'm skipping the rest, of your message because you've started from a wrong premise and sorting out which bits still apply even after accounting for the wrong premise is not something I have time, energy and inclination for. Sorry, A./
Re: Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)
Philipe, Text typeset in Fraktur contains more information than text typset in Antiqua. That means, there are some places where there are some (mild) ambiguities in representation in the Antiqua version. Not enough to bother a human reader who can use deep context to read the text correctly, but enough so that a mere typesetting system cannot correctly render such a text in Fraktur. I'm not currently aware of anything that would prevent an automated system from converting a text encoded for Fraktur to one encoded for Antiqua, because you are merely throwing away information. So far we agree. The question is whether it would be possible to make this process work by default in common, unmodified rendering engines, and whether that is desirable. (I don't treat either of these question as settled one way or the other - so please don't attribute a position to me on that subject). What I do know is that there are historic documents using Antiqua fonts that do use the long s. Therefore, in principle, you don't necessarily want to create fonts that map long to round s. And, as an author, you can't rely on such a font being present on the reader's end - it might equally likely be one that does implement the long s. So, whatever automatic rendering of Fraktur-ready text with non-Fraktur general purpose fonts you have in mind, should not rely on this kind of non-standard glyph substitution. That would be a terrible hack, imperiling the ability of people to use the long s outside the context of the Fraktur tradition. All I had argued for was that Karl should take out the consideration of rendering text encoded for Fraktur from his proposal and make it part of a separate document that addresses ALL issues of this type of rendering, making it a complete specification - that would be something that allows review on its own merits. A./
Re: Plain text
On 7/28/2010 9:32 PM, Doug Ewell wrote: Murray Sargent murrays at exchange dot microsoft dot com wrote: It's worth remembering that plain text is a format that was introduced due to the limitations of early computers. Books have always been rendered with at least some degree of rich text. And due to the complexity of Unicode, even Unicode plain text often needs to be rendered with more than one font. I disagree with this assessment of plain text. When you consider the basic equivalence of the same text written in longhand by different people, typed on a typewriter, finger-painted by a child, spray-painted through a stencil, etc., it's clear that the sameness is an attribute of the underlying plain text. None of these examples has anything to do with computers, old or new. That may be, but the way Unicode plain text is designed, is based on the concept of plain text in computers, and what that means was hashed out long before Unicode arrived on the scene. To a large measure, what Unicode did, was extend that concept to additional writing systems (and to historic or rarely used nooks and crannies of some of the existing writing systems). In the process, your definition of plain text was pulled out, dusted off, and used as a philosophical underpinning of the enterprise - but the technologists in the effort did not first discard any notions of computer-based plain text before proceeding. In other words, claiming a clean break between the existing ASCII plain text and Unicode would be a falsification. I do agree that rich text has existed for a long time, possibly as long as plain text (though I doubt that, when you consider really early writing technologies like palm leaves), but I don't think that refutes the independent existence of plain text. And I don't think the need to use more than one font to render some Unicode text implies it isn't plain text. I think that has more to do with aesthetics (a rich-text concept) and technical limits on font size. No, it's not headings and the like. If you pull together a selection of ordinary books in the English language and remove rich text attributes, you will find a considerable fraction of the works will exhibit subtle changes in meaning - these works require italics to mark emphasis in places where the same sequence of words can be read in different ways. Scholarly works require italics for citations - absent italics, some other method would need to be introduced to mark titles, without any designation, there can and will be ambiguities. Hence, not all texts can be expressed as plain text. If you take a German text, rendered (by a human typesetter) in Fraktur and rendered (by a later typesetter) in Antiqua, you will find that the second version has less information in it, when you encode both texts on a computer. And many texts that can be represented as plain text if they are to be rendered in Antiqua cannot be plain text if they are to be rendered according to the rules of typesetting a work in the Fraktur style - again, we are talking ordinary running text, no headings, bibliographies or anything. The additional information is not of an aesthetic or stylistic nature, but tied to the meaning of certain words - that which Unicode calls semantic. In other words, the text, as rendered in Antiqua, allows for potential ambiguities - not necessarily fatal ones, because context may easily resolve them, but they are there, nevertheless. This is just one example how the concept of an abstract content of a piece of text is not nearly as clearcut as you might think. On the contrary, the definition of Unicode plain text is straight forward: a sequence of Unicode characters without any style information. A./
Re: High dot/dot above punctuation?
On 7/28/2010 2:02 AM, Kent Karlsson wrote: Den 2010-07-28 09.50, skrev Jukka K. Korpela jkorp...@cs.tut.fi: André Szabolcs Szelp wrote: Generally, for the decimal point . (U+002E FULLSTOP) and , (U+002C COMMA) is used in the SI world. However, earlier conventions could use different notation, such as the common British raised dot which centers with the lining digits (i.e. that would be U+00B7 MIDDLE DOT). The different dot-like characters are quite a mess, but the case of British raised dot is simple: it is regarded as typographic variant of FULL STOP. Ref.: http://unicode.org/uni2book/ch06.pdf (second page, paragraph with run-in heading Typographic variation). And the Nameslist says: 002EFULL STOP = period, dot, decimal point * may be rendered as a raised decimal point in old style numbers However, I think that is a bad idea: firstly the digits here aren't necessarily old style (indeed, André wrote lining, i.e. NOT old style). And even if they are old style, it seems to me to be a bad idea to make this a contextual rendering change for FULL STOP (and it also says may not shall so there is no way of knowing which rendering you should get even with old style digits). Better stay with the MIDDLE DOT for the raised decimal dot. The real problem I have with this annotation is that it recommends a practice that I strongly suspect has never been implemented in the entire 20 years since it's been on the books. (If anyone knows of an implementation that has contextual rendering of FULL STOP, I'd like to learn about it here.) If a particular text uses both raised periods and raised decimal points, then I see use in being able to use 002E for this and make it change by using a font with a different glyph. But if it applies only to the decimal point, overloading 002E would require a degree of context analysis that I believe is unimplemented (see above). If my suspicion is true, then, at the minimum, the annotation should be reworded so that it doesn't seem to imply a practice that doesn't exist. Further, I don't see any major problem with using U+02D9 DOT ABOVE for high dot in this case. Me neither - if it's positioned right, then it should be used. Duplicating dots by function is definitely a no-no. However, unfiying punctuation characters with definite differences in appearance only works well if these differences are systematically applied with a type-style (font) selection and then apply to the entire text in each font. Such as the use of a double oblique glyph for HYPHEN (and HYPHEN-MINUS) in Fraktur fonts. A./
Re: High dot/dot above punctuation?
On 7/28/2010 10:09 AM, Murray Sargent wrote: Contextual rendering is getting to be more common thanks to adoption of OpenType features. For example, both MS Publisher 2010 and MS Word 2010 support various contextually dependent OpenType features at the user's discretion. The choice of glyph for U+002E could be chosen according to an OpenType style. I know that the technology exists that (in principle) can overcome an early limitation of 1:1 relation between characters and glyphs in a single font. I also know that this technology has been implemented for certain (but not all) types of mappings that are not 1:1. It's worth remembering that plain text is a format that was introduced due to the limitations of early computers. Books have always been rendered with at least some degree of rich text. And due to the complexity of Unicode, even Unicode plain text often needs to be rendered with more than one font. However, the question I raised here is whether such mechanisms have been implemented to date for FULL STOP. Which implementation makes the required context analysis to determine whether 002E is part of a number during layout? If it does make this determination, which OpenType feature does it invoke? Which font supports this particular OpenType feature? A./
Re: Reasonable to propose stability policy on numeric type = decimal
On 7/28/2010 10:13 PM, Martin J. Dürst wrote: Sequences of numeric Kanji are also used in names and word-plays, and as sequences of individual small numbers. But the same applies to our digits. A very simple example is to use them as a ruler in plain text: 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 Didn't see this before I sent mine. Martin says it better. A./
Re: Why does EULER CONSTANT not have math property and PLANCK CONSTANT does?
On 7/27/2010 3:02 PM, Kenneth Whistler wrote: Karl Williamson asked: Subject: Why does EULER CONSTANT not have math property and PLANCK CONSTANT does? They are U+2107 and U+210E respectively. Because U+210E PLANCK CONSTANT is, to quote the standard, simply a mathematical italic h. It serves as the filler for the gap in the run of mathematical italic letters at U+1D455. Correct - they form a set and need to be treated consistently. Other letterlike symbols in that block are not given the Other_Math property, even if they may be used in mathematical expressions. (Note that regular Greek letters are also not given the Other_Math property, even though they obviously also occur in mathematical expressions.) For Euler Constant and Weierstrass elliptic function, this doesn't make a lot of sense, as these are explicitly mathematical characters, not characters that are also used in mathematical expressions. I have put in a formal proposal to add these two (2107 and 2118) to the list of characters with the math property. The Math property can be thought of as a hint that a particular symbol is specialized for mathematical usage; it isn't a property that any character that ever occurs in a mathematical expression needs to have. Nor is every character with the Math property only used in mathematical contexts. One way to look at this property is as a way to help detection of mathematical expressions in running text. Characters that are primarily used for mathematical purposes, or prominently used there, should be included. Characters that are heavily used in ordinary text, with non-mathematical uses should be excluded. A./
Re: ? Reasonable to propose stability policy on numeric type = decimal
On 7/26/2010 12:13 PM, Mark Davis ☕ wrote: I agree that having it stated at point of use is useful - and we do that in other cases covered by stability clauses; but we can only state it IF we have the corresponding stability policy. Mark, The statement in your but clause really isn't correct. Writing A character is given/is assigned the X property if is a type of statement that is made everywhere in the definitions of properties. For an example look no further than chapter 4 (Pairs of opening and closing punctuation are given their General_Category values...). Therefore, the principal difference between my proposed formulation to the current text (other than details of phrasing) is the only if part. The only if refers to the fact that Decimal_digit is currently not assigned for characters used as decimal digits that are out of order. Therefore, there's nothing in the proposed language that couldn't be stated right now for 6.0. If you want a stability guarantee on top of that, it's really easy to state *after* you've clarified the definition of decimal_digit. The definition of Decimal_Digit will not change. *That* would be a proper stability guarantee. A./ PS: I'm, like John, rather skeptical about adding a formal item to the stability policies, but if a majority feels otherwise, I would strongly recommend to first make a tight definition, and second, freezing that definition, rather than repeating the definition in the stability policies where it's hard to follow and out of context. Proposed text: // A character is given the decimal digit property, if and only if, it is used in a decimal place-value notation and all 10 digits are encoded in a single unbroken run starting with the digit of value 0, in ascending order of magnitude.
Re: Reasonable to propose stability policy on numeric type = decimal
The short answer to Karl's question is that there will not be an absolute guarantee. The long answer is that, partly for the reasons he's mentioned, this won't be a practical problem. A. Most of the living scripts that are in wide use have been encoded, including whatever digits are in use. B. People reviewing encoding proposals include programmers who would object to scattering digits Thus, the only time this would be an issue is if there were some exceptional circumstances. And, as the name says, those circumstances could force an exception. If that happens there are two possible consequences: 1. The script in question is important enough that everybody will build in exceptions into their conversion algorithms 2. The script is so unimportant, that its number system won't be supported (i.e. it's treated just like other text). So, for extending your computer language, there's no reason to hold up support for many important scripts, just because of a hypothetical future exception. A./ PS: just because I suspect more than one existing implementation to be offset-based, there would be tremendous pressure to prevent exceptions already :) PPS: a very hypothetical tough case would be a script where letters serve both as letters and as decimal place-value digits, and with modern living practice. Having a policy like you suggest would officially make that unsupportable, but there are other cases, like the language that wanted to used @ sign as a letter, that are de-facto unsupportable with the modern infrastructure. My suspicion is that users of such a script would realize that their method is de-facto unsupported/able and find some way to change their ways. Changing practices in the face of changing technology is something that happens all the time, not only to small communities - but that's an entirely new subject :)
Re: Reasonable to propose stability policy on numeric type = decimal
On 7/25/2010 6:05 PM, Martin J. Dürst wrote: On 2010/07/26 4:37, Asmus Freytag wrote: PPS: a very hypothetical tough case would be a script where letters serve both as letters and as decimal place-value digits, and with modern living practice. Well, there actually is such a script, namely Han. The digits (一、 二、三、四、五、六、七、八、九、〇) are used both as letters and as decimal place-value digits, and they are scattered widely, and of course there are is a lot of modern living practice. Martin, you found the hidden clue and solved it, first prize :) They do not show up as gc=Nd, nor as numeric types Digit or Decimal. The situation is worse than you indicate, because the same characters are also used as elements in a system that doesn't use place-value, but uses special characters to show powers of 10. However, as I indicated in my original post, in situations like that, there are usually some changes in practice that took place. Much of the living modern practice in these countries involves ASCII digits. While the ideographic numbers are definitely still used in certain contexts, I've not seen them in input fields and would frankly doubt that they exist there. I would fully expect that they are supported as number format for output, at least in some implementations, and, of course, that input methods convert ASCII digits into them. In other words, I wonder whether automatic conversion goes only one-way for these numbers. I would suspect it, for the general case, but I don't actually know for sure. For someone in Karl's situation, it would be interesting to learn whether and to what extent he should bother supporting these numbers in his language extension. A./
Re: ? Reasonable to propose stability policy on numeric type = decimal
On 7/24/2010 3:00 PM, Bill Poser wrote: On Sat, Jul 24, 2010 at 1:00 PM, Michael Everson ever...@evertype.com wrote: Digits can be scattered randomly about the code space and it wouldn't make any difference. Having written a library for performing conversions between Unicode strings and numbers, I disagree. While it is not all that hard to deal with the case in which the characters for the digits are scattered about the code space, if they occupy a contiguous set of code points in order of their value as they do, e.g., in ASCII, it simplifies both the conversion itself and such tasks as identifying the numeral system of a numeric string and checking the validity of a string as a number in a particular numeral system. It may well be that adopting such a policy is not realistic, but there would be advantages to it if were. Bill, Michael is no programmer, hence he doesn't have first hand understanding why programmers distiguish between character set mapping (normally requiring look-up tables) and digit conversion (normally done by offset calculations). That said, there are enough programmers on the committees so that scattered encoding of digits, while not prevented, is at least not the method of choice. The problem with making this a policy is that some scripts may not have a decimal place-value type number system (or such use is not documented) at the time of their encoding. That means, a digit zero may not be known or documented. However, a prudent encoding policy would be to leave a gap in that case, because there have been scripts for which use of a decimal place-value system was later discovered. A./
Re: charset parameter in Google Groups
Andreas, I think we all realize your frustration with well-meaning software. Because tags can be wrong for no fault of the human originating the document, I fully understand that Google might want to attempt to improve the user experience in such situations. The problem is that doing so should not come at the expense of authors who correctly tag their documents and whose servers preserve their tags and don't mess with them. That your message was broken exposed a bug in Google's implementation. And that was acknowledged as well. I have not seen any design details of the algorithm that Google uses (when correctly implemented) so I can't comment on whether it strikes the correct balance between honoring tags in the normal case, where they should be presumed to be correctly applied, vs. detecting the case of clearly erroneous tags and doing something about them so users aren't stuck when documents are mis-tagged. However, in principle, I support the development of tools that can handle imperfect input - after all, you as a human reader also don't reject language that isn't perfectly spelled or that violates some of the grammatical rules. There's a benefit to these kinds of tools, but, as you keep reminding us, there's a cost (which needs to be minimized). This cost is similar to that of a spell-checker rejecting a correctly spelled word. Still we are better off with them than without them. For that reason, I think you will find few takers for your somewhat absolutist position, whereas you would get more sympathy if you were simply reminding everyone of the dangers of poorly implemented solutions that can break correctly tagged data. A./
Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)
On 6/28/2010 11:38 AM, Mark Davis ☕ wrote: The problem with slavishly following the charset parameter is that it is often incorrect. However, the charset parameter is a signal into the character detection module, so the charset is correctly supplied from the message then the results of the detection will be weighted that direction. The weighting factor / mechanism may be something that you might look at for possible improvement. Doug raised an interesting argument, i.e. that some values of a charset parameter might have a higher probability of being correct than other values. If something is tagged Latin-1 or Windows-1252, the chances are that this is merely an unexamined default setting. Most of the other 8859 values should be much less likely to be such blind defaults. I wonder whether the probability of successful charset assignment increases if you were to give these more specific charset values a higher weight. When I played with simple recognition algorithms about 15 years ago, I found that some simple methods for crude language detection gave signatures that would allow charset detection. Even though these methods weren't sophisticated enough to resolve actual languages (esp. among closely related languages) they were good enough to narrow things down to the point, where one could pick or confirm charsets. For example, significant stretches of German can be written without diacritics, and can fool charset detection unless it picks up on the statistic patterns for German. With that in hand, the first non-ASCII character encountered is then likely to nail the charset. Or, absent such character, the statistics can be used to confirm that an existing charset assignment is plausible. (8859-15 having been deliberately designed to be undetectable is the exception, unless there's a Euro sign in the scanned part of the document...) A./
Re: Latin Script
I'd like to second Mark. There is a lot of information in the Standard, including the UAXs, and the Unicode Character Database that would help answer your questions. The volunteers associated with the Unicode effort have worked hard putting all that information together - so use it, instead of taking up their time in repeating it all in personal answers to you. A./ On 6/28/2010 9:37 PM, Mark Davis ☕ wrote: See the following for the (/many/) differences between characters with the Latin script, and those with LATIN in their names. http://unicode.org/cldr/utility/unicodeset.jsp?a=\p{script:latin}b=\p{name:/LATIN/} http://unicode.org/cldr/utility/unicodeset.jsp?a=%5Cp%7Bscript:latin%7Db=%5Cp%7Bname:/LATIN/%7D I'd suggest taking a more focused approach to learning about the standard, rather than trying relatively scattershot questions to this list. You might read through at least the first 3 chapters of the Unicode Standard, plus the Scripts UAX. These are all online for free at unicode.org http://unicode.org. Mark — Il meglio è l’inimico del bene — On Mon, Jun 28, 2010 at 20:55, Tulasi tulas...@gmail.com mailto:tulas...@gmail.com wrote: Looks like Unicode did not create any name for any Latin letter/symbol with LATIN in its name :-') Am I correct? Is there a mailing list for ISO/IEC ? I don't think it's necessary to post these glyphs to the public list. Better to do like Edward Cherlin, i.e., type the symbol after the name. e.g., LATIN SMALL LETTER PHI (ɸ) That way an illiterate like me can quickly see the letter/symbol along with its name, without additional research. The merger between Unicode and ISO 10646 caused a few character names in Unicode to be changed to match the 10646 names. My I know these letters/symbols with names please? Tulasi PS: Thanks Doug, especially for posting the links From: Doug Ewell d...@ewellic.org mailto:d...@ewellic.org Date: Sun, 27 Jun 2010 16:09:41 -0600 Subject: Re: Latin Script To: Unicode Mailing List unicode@unicode.org mailto:unicode@unicode.org Cc: Tulasi tulas...@gmail.com mailto:tulas...@gmail.com Tulasi tulasird at gmail dot com wrote: U+00AA FEMININE ORDINAL INDICATOR (which does not contain LATIN) is considered part of the Latin script, while U+271D LATIN CROSS (which does) is considered common to all scripts. Can you post both symbols please, thanks? I can point you to http://www.unicode.org/charts/PDF/U0080.pdf , which includes a glyph for U+00AA, and http://www.unicode.org/charts/PDF/U2700.pdf , which includes a glyph for U+271D. I don't think it's necessary to post these glyphs to the public list. Trying to know who among ISO and Unicode first created the names' list for Latin-script is not an indication of obsession :-') So among Unicode and ISO/IEC, who first created ISO/IEC 8859-1 ISO/IEC 8859-2 letters/symbols names with each name with LATIN in it? Most of the characters in the various parts of ISO 8859 were originally standardized before Unicode or ISO 10646, so the names were probably either created by the ISO/IEC subcommittees responsible for those parts, or found in earlier standards and adopted as-is. The merger between Unicode and ISO 10646 caused a few character names in Unicode to be changed to match the 10646 names. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: Generic Base Letter
The one argument that I find convincing is that too many implementations seem set to disallow generic combination, relying instead on fixed tables of known/permissible combinations. In that situation, a formally adopted character with the clearly stated semantic of is expected to actually render with ANY combining mark from ANY script would have an advantage. List-based implementations would then know that this character is expected to be added to the rendering tables for all marks of all scripts. Until and unless that is done, it couldn't be used successfully in those environments, but if the proposers could get buy-in from a critical mass of vendors of such implementations, this problem could be overcome. Without such a buy-in, by the way, I would be extremely wary of such a proposal, because the table-based nature of these implementations would prohibit the use of this new character in the intended way. A./
Re: Indian Rupee Sign to be chosen today
On 6/26/2010 5:41 PM, Doug Ewell wrote: Regarding the inability to distinguish 8859-15 heuristically from 8859-1, I understand the problem when there are no tags or other hints, or for cases like Windows-1252 text declared to be 8859-1, but it seems unlikely to me that there is much text encoded in 8859-1 (or Windows-1252) that is tagged as 8859-15. I would think in a case like that, it might make sense to trust the tag. I suspect the problem of unreliable declarations is greater for most other tuples of (declared-encoding, actual-encoding). Doug, this is an interesting concept, i.e. that the reliability of the tag being correct might well depend on the value of the tag. I wonder whether that type of probability is being considered at all when making the decision to trust auto-recognition over tag value. A./
Re: Latin Script
On 6/17/2010 7:24 PM, Tulasi wrote: What is equivalent ISO/IEC ISO/IEC what? There are hundreds of ISO/IEC standards, of which dozens are character encoding standards. for U+0278 LATIN SMALL LETTER PHI (ɸ)? Or do Unicode ISO/IEC use different number name for same letter/symbol? ISO/IEC 10646 uses the same number and name as Unicode for this. A./
Re: Writing a proposal for an unusual script: SignWriting
On 6/14/2010 1:18 PM, Mark E. Shoulson wrote: On 06/14/2010 02:15 PM, Asmus Freytag wrote: On 6/14/2010 9:21 AM, Stephen Slevinski wrote: Plain text SignWriting should be able to write actual sign language, such as hello world. You could equally well insist that it should be possible to express the opening bar of twinkle, twinkle little star in plain text, or write the square root of the inverse of a plus b in plain text. In both cases, you would be disappointed and find that a markup language is required, such as MathML, although specifically for math, it is possible to device an extremely light weight markup language that comes close to plain text. It is all too tempting and too easy for discussions of Why X Should be Encoded in Unicode to devolve into Why X is So Incredibly Useful. In this case, I don't think that's the point. Correct, we were not discussing that question. Unlike some other proposals, I think it is clear (to me, anyway) that SignWriting has a fairly solid user-base and also an important use (transcribing signed languages, which don't really have too many other ways of being transcribed. Things like HamNoSys are also not encoded yet). Mark (Davis) raised the good point that this needs to be substantiated - for now, for the purposes of this discussion, I taken the above as a given. Here, the question is more a matter of given that SignWriting is nifty, does it qualify as plain text? That is the central question. Or even Does the way SignWriting does its thing map well to the way Unicode does things? I tried to explain that these are nearly equivalent. A practical definition of plain text could be, text encoded as a stream of Unicode characters, with no other information. However, there are other definitions of plain text based on the ideal concept of the thing, and the two don't overlap 100%. Both are useful. If it does not (and cannot be made to do so), then no matter how useful SignWriting is, it may simply not be encodable. It's not because it doesn't deserve to be, and yes, that would really be a bummer because it would relegate signed languages to second-class, but Unicode has its limitations, and SignWriting may well be beyond its capabilities. That's where my insistent questions about a layered system come in. One where the elements (symbols) are encoded in Unicode, but where some or all the details of their relation is encoded in a higher level protocol. I suspect that the XML attempts that exist do not implement a correct layering, that is, they probably encode the identity of the symbols not as character codes but as named entities. That would explain why Steve said same data, only more complex. (That said, I find myself thinking that it *should* be possible to align Unicode and SignWriting. But I recognize that it might not be.) As long as the position of the proponents is that all fine details of formatting and layout must be carried in the character encoding level, I'm not hopeful. Not all streams of concrete small integers are ispso facto plain text, even though you can map these integers to the private use space. I guess you would need to establish a distinct and independent meaning for each code-point, which would have to be something more specific than ...and then you give the x-coordinate. Generic placement operators I could possibly fathom, since they serve to linearize the text - an analogy would be the Ideographic Description Symbols that allow description of a two dimensional layout. But the IDS stop short of trying to express the subtle modifications that arise out of the context and placement of the elements in the final ideograph. For that you have to turn to another source, in this case a font. For the future, I am considering a browser plugin that will detect and render SignWriting character data. A regular expression could scrape the appropriate PUA characters. Another regular expression could validate that the characters represent valid structures. Then the SignWriting display could be built using individual symbols, completed signs, or entire columns. In other words, a layout engine. Is there such a thing as SignWriting without a layout engine? I guess the same question could be asked about Musical notation (though I think it probably could have been coded as plain text. See also http://abcnotation.com/ for a very powerful musical notation using only ascii, but decidedly *not* plain-text in nature). The point is, because one already requires a layout engine (or browser plug-in) one might as well use something like MathML in conjuction with standard character codes for the basic symbols. If SignWriting cannot be successfully used except with 2 fonts, then I see little need for standardizing the code. What you describe is a private use scheme, even though the private group may have many members. I'm not sure I agree with this. Just because only two fonts are out
Re: Tamil u,uu matra consonants - Orthographic variation
Can we stop double posting on Unicode and Unicore list? People on the unicode list cannot reply to people on the other list, and vice versa (unless they happen to be mermbers of both lists). Thanks. A./
Re: Questionable lines on LineBreakTest.txt
On 6/7/2010 4:26 PM, Masaaki Shibata wrote: I'm studying the UAX #14 (5.2.0) and testing my code against LineBreakTest.txt. And I found some test cases on this text file seem to be contradictory to the rules on the document. For example, LB25 explicitly prohibits breaking between CP and PO, while LineBreakTest.txt says ÷ [0.2] RIGHT PARENTHESIS (CP) ÷ [999.0] PERCENT SIGN (PO) ÷ [0.3] (l. 1137). I'm not a Unicode expert; which rules lead to the result like this? Did I miss any important descriptions on the document? Probably not. The test file has been known to be wrong before. The spec clearly states that breaks are only allowed if there are spaces, as in: CP SP+ ÷ OP So this line in the test file appears incorrect. A./
Re: Least used parts of BMP.
On 6/4/2010 8:34 AM, Mark Davis ☕ wrote: In a compression format, that doesn't matter; you can't expect random access, nor many of the other features of UTF-8. The minimal expectation for these kinds of simple compression is that when you write a string with a particular /write/ method, and then read it back with the corresponding /read/ method, you get exactly the original string contents back, and you consume exactly as many bytes as you had written. There are really no other guarantees. Actually, SCSU makes an additional guarantee, which is that you can edit the compressed string. In other words, you can insert a substring such that the new string remains a valid compressed string and the parts preceding and following the insertion, when read, match the corresponding portion of the original after decoding. I remember that this was an important design criterion for the precursor RCSU. Their implementation required the ability to deliver a patch to a compressed string, something that isn't possible with many other compression formats. So there is a sliding scale in features, each compression method being designed to address the specific requirements of given application. A./ Mark — Il meglio è l’inimico del bene — On Fri, Jun 4, 2010 at 06:35, Otto Stolz otto.st...@uni-konstanz.de mailto:otto.st...@uni-konstanz.de wrote: Hello, Am 2010-06-03 07:07, schrieb Kannan Goundan: This is currently what I do (I was referring to this as the compact UTF-8-like encoding). The one difference is that I put all the marker bits in the first byte (instead of in the high bit of every byte): 0xxx 10xx xyyy 110x xxyy yzzz The problem with this encoding is that the trailing bytes are not clearly marked: they may start with any of '0', '10', or '110'; only '111' would mark a byte unambiguously as a trailing one. In contrast, in UTF-8 every single byte carries a marker that unambiguously marks it as either a single ASCII byte, a starting, or a continuation byte; hence you have not to go back to the beginning of the whole data stream to recognize, and decode, a group of bytes. Best wishes, Otto Stolz
Re: Greek letter LAMDA?
On 6/1/2010 6:04 PM, Mark Crispin wrote: I don't think that the unicode list should be used for the type of questions that have polluted it recently. That list unicode@unicode.org is open for general questions. It has no formal standing as far as the business of the Consortium is concerned, and many core UTC members are NOT on this list, because it attracts general questions etc. A./ PS: and if you've forgotten, once does need to subscribe to the list in order to post, so it already fits your definition of members-only.
Re: Least used parts of BMP.
On 6/1/2010 8:04 PM, Kannan Goundan wrote: I'm trying to come up with a compact encoding for Unicode strings for data serialization purposes. The goals are fast read/write and small size. Why not use SCSU? You get the small size and the encoder/decoder aren't that complicated. You get the additional advantage that some many years in the future, if data that are serialized to your scheme are found on an old hard-disk, someone has a chance to read them, because SCSU is well-documented (see UTS#6). A./
Re: Greek letter LAMDA?
On 6/2/2010 11:46 AM, Jonathan Rosenne wrote: Although this mail was not addressed to me, I did read it. Sue me. The terms of use for the Unicode mail list essentially state that these types of boilerplate are null and void as far as Unicode is concerned. You will find the following in http://www.unicode.org/policies/mail_policy.html Disclaimer E-mail submitted to any of our e-mail lists which contains disclaimers of confidentiality or reservation of copyright, or similar, will be treated as if these disclaimers were not present, and neither the Consortium nor the users of our e-mail lists shall be liable for their use of the information in the e-mail under this policy. It is up to the submitter to ensure that no confidential or otherwise restricted information is sent to any e-mail list. As you can see, they have no grounds to sue you. :) A./ Jony -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of John Dlugosz Sent: Wednesday, June 02, 2010 5:03 PM Cc: unicode@unicode.org Subject: RE: Greek letter "LAMDA"? Robert Abel noted: Note that as of 1993, the only "LAMDA" or "LAMBDA" characters in the standard were: 039B;GREEK CAPITAL LETTER LAMDA;Lu;0;L;N;GREEK CAPITAL LETTER LAMBDA;;;03BB; 03BB;GREEK SMALL LETTER LAMDA;Ll;0;L;N;GREEK SMALL LETTER LAMBDA;;039B;;039B 019B;LATIN SMALL LETTER LAMBDA WITH STROKE;Ll;0;L;N;LATIN SMALL LETTER BARRED LAMBDA So why was 019B spelled differently than the other two, originally? TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
Re: Greek letter LAMDA?
On 6/2/2010 3:28 PM, John Dlugosz wrote: If anyone can “null and void” it, I wonder why companies bother to put such things in people’s outgoing mail. I would have thought they could come up with a proper net-etiquite version, but they just don’t care. These things are bogus, because they get appended automatically to all messages leaving certain mailers, independent of the nature of the message. I wouldn't be surprised if they are hard to enforce, but I'm not a lawyer. The Unicode list can certainly set its own conditions for participation, and because you have to sign up, I'd rate the chance that Unicode can enforce its rules on participants rather high. Therefore, anyone sending messages with funny legal mumbo-jumbo is put on notice beforehand that it will not be respected. If they go ahead and send it anyway, that's their choice, but they'd have a tough time arguing that they could have a reasonable expectation that it would be honored. So, I think, in the case of a mail list like this, you can actually get away with declaring these things null void. Cheers, A./ PS: we should stop with this topic, because that's not what this list is for.
Re: Least used parts of BMP.
SCSU is a pass-through for ASCII, plus it handles the common mix of ASCII plus 96 local characters (Latin-1, Greek, Cyrillic, Thai, etc) really fast. Go look at the sample code. If you take that as starting point for optimization, I think you'll be fine.
Re: Greek letter LAMDA?
On 6/1/2010 1:37 PM, John Dlugosz wrote: Why does the code chart call the plain Greek letter (upper and lower case) “LAMDA” rather than “LAMBDA”? The latter is used in other places where a glyph is based on the lambda, e.g. “U+019B LATIN SMALL LETTER LAMBDA WITH STROKE” Names sometimes don't use the best spellings, but because they can't be changed, any spelling issues discovered after the first encoding can't be fixed. Make sure you don't use the Standard as a spelling reference. A./
Re: Greek letter LAMDA?
On 6/1/2010 4:14 PM, Mark Crispin wrote: Is it really necessary to have this sort of pedagogical discussions on the Unicode list? Is this character name misspelled? Is Unicode a for-profit company? Who owns the Unicode font? etc. etc. Perhaps we need to have a unicode-qu...@unicode.org for novices to ask questions, and make this list be member-only? There is a member only list for in-depth technical discussions. Are you a member? A./
Re: Unicode Inc
On 5/31/2010 12:33 PM, Tulasi wrote: Thanks Mark for posting the links! My posting was based on http://www.unicode.org/consortium/directors.html where in the bottom it said Unicode Inc. Looks like the elected members from consortium http://www.unicode.org/consortium/consort.html forms Unicode Inc. Am I correct? Not really. The members of the consortium are other organizations, usually corporations. Each organization is represented by people (delegates). The delegates are not members of the consortium, but merely people who represent each member organization. Each organization normally gets one vote, even though most send two delegates. In this link http://www.unicode.org/consortium/consort.html it looks like less than 100 members in the consortium. How many members currently do you have who can vote to elect? Link http://www.unicode.org/consortium/consort.html says Unicode consortium is non-profit organization. Recently I have purchased Windows 7 from Microsoft Corporation This product has Unicode fonts for number of language. But Microsoft Corporation is for profit. So it looks like Unicode Inc is for profit through its elected officials, but Unicode consortium is non-profit. Am I correct? No. Unicode, Inc (The Unicode Consortium) is a non-profit organization. That means, it must meet certain legal requirements and restrictions in how it is funded and operated. The same requirements do not apply to its membership. Specifically, both for-profit and not-for profit organizations may be members of the Consortium. There is nothing unusual about the fact that the for-profit status of the members is unrelated to the non-profit nature of the Consortium. That's essentially the case for all non-profit organizations. I still do not understand: What is the role of this Director exactly? I think you are asking very basic questions about how a non-profit corporation is organized. Rather than continuing this discussion at great detail here on a list that is intended for character encoding questions, you might start by reading up on basic background, for example in the Wikipedia: http://en.wikipedia.org/wiki/Non-profit_organization and http://en.wikipedia.org/wiki/Board_of_Directors A./ Respectfully, Tulasi From: Mark Davis m...@macchiato.com Date: Fri, 28 May 2010 09:14:00 -0700 Subject: Re: Unicode Inc To: Tulasi tulas...@gmail.com Cc: Unicode Discussion unicode@unicode.org See http://www.unicode.org/consortium/consort.html. The consortium is constituted according to its bylaws: http://unicode.org/consortium/unicode-bylaws.html Roughly, it is constituted by its membership: http://www.unicode.org/consortium/memblogo.html, which elects the directors yearly. The officers report to the directors, and are responsible for the running of the consortium. The technical work is delegated to the technical committees, which operate according to their procedures. The background of the officers and directors can be found on http://www.unicode.org/consortium/directors.html. For a historical view, see http://www.unicode.org/history/boardmembers.html. Mark On Thu, May 27, 2010 at 17:32, Tulasi tulas...@gmail.com wrote: I am new to this group. I am browsing http://www.unicode.org/consortium/directors.html It looks like Unicode Inc is formed by Google, Inc. Microsoft Corporation IBM Corporation Apple Have I understood correct? Also it looks like Unicode Incorporate has one director to represent whole Asia, while Asia has more languages than any continent. What is the role of this Director exactly? Respectfully, Tulasi
Re: IS UNICODE a STANDRAD ?
On 5/31/2010 2:12 PM, V. M. Kumaraswamy wrote: Hello all, Just a clarification an UNICODE. Is UNICODE a STANDRAD Yes, Unicode (The Unicode Standard), is indeed a standard. And no, the use of ALL CAPS is discouraged. The proper spelling is Unicode. that needs to be followed by all COUNTRIES ? there's no requirement for anyone to be conformant to the Unicode Standard. However, if you decide to claim conformance, there are specific requirements that you must meet, and they are defined in the Standard. Is UNICODE a CONSORTIUM to make certain guidlines that needs to be followed for CERTAIN CHARCTERISTICS ? Yes, Unicode (The Unicode Consortium) is indeed a consortium. If you just use Unicode as a shorthand, you need to rely on the context of your communication to allow readers to understand whether you mean the Standard or the Consortium. The Unicode Consortium is the publisher of The Unicode Standard as well as several other technical standards. As with the Unicode Standard, there is no requirement that you support these standards. But if you decide to claim conformance to any of them, there are specific requirements that you must meet. Hope this makes the situation more clear. A./ This si just to some input from all of you. Thanks Sincerely V. M. Kumaraswamy