Re: Missing geometric shapes
On 11/9/2012 1:26 AM, Jean-François Colson wrote: For a five level rating, ○ ◔ ◑ ◕ ● could do the job. Yes it's possible to use other sets of symbols to indicate rating, but when it comes to such use of symbols Unicode would not encode the semantic of rating but that of star. The deeper semantic is one of convention. Not unlike the question whether y is a vowel or a consonant (yo-yo), which is a matter of convention between writer and reader. A./
Re: Missing geometric shapes
On 11/9/2012 5:53 PM, Philippe Verdy wrote: Why then stars ? Any symbol, even any Unicode letter could be repeated and half-filled. There's nothing magical about limiting the half-filled geometrical shapes to the current (haphazard) set. If half-filled stars can be documented, they are legitimate targets for encoding. If someone later documents half-filled pentagons, again, the case would be decided on the merits. I really hate the speculation on this list about notational conventions -- the rule should be: if notational conventions exist, and can be documented, the characters needed for them should be eligible to be considered. Even logos (I've seen Apple logos used this way) Logos are ineligible for other reasons, and that puts them out of discussion here. or pictograms (I've seen Most of these graphics are simply used in repetition. Only shapes that lend themselves to being half-filled will show up in use. Now, Unicode has recently introduced the innovation of formally encoding variation sequences for emoji-style symbols - expressing a desire to explicitly representing a unification of certain basic shapes with precisely equivalent fancy renditions of the same. In the current instance that could mean (by extension) that at some point various fancy renditions of stars are officially unified with the plain stars (by adding a similar variation sequence). Fancy star symbols that I have seen include those that are colored instead of black or have colored background (on a per symbol basis, not text background like highlighting). Even today, using the existing Unicode for the WHITE STAR character allows performing styling on it to render an empty, full, or partially filled star. There's clear precedent that Unicode views white/black/partially filled as a distinction on the character level (this is definitely the case for several types of geometrical symbols - witness circles and squares). Using styles to achieve that effect is possible (lots of things are possible), but it would be a violation of the character / glyph model to achieve such distinction by style, when it is present on the character level. The precedent here, clearly speaks in favor of recognizing half-filled stars likewise as a distinction on the character level. If you start encoding a document using uncommon characters, automated Braille or aural readers won't know what to do with them... I think this argument is a red herring. For me all the graphical substitutions of numeric figures are NOT plain text, they are presentational features for visual rendering, ... The fact that you can think of a series of symbols as representing their count, doesn't make a series of symbols merely a numeric representation of that count. But even if one were to take this view: Unicode painstakingly encodes characters for the different representations of digits, instead of relying merely on styles and glyphs to handle the representation of numbers. So, you see, even here, the precedent goes the other way. A./
Re: The rules of encoding (from Re: Missing geometric shapes)
On 11/9/2012 7:14 PM, Philippe Verdy wrote: 2012/11/9 Asmus Freytag asm...@ix.netcom.com: Actually, there are certain instances where characters are encoded based on expected usage. Currency symbols are a well known case for that, but there have been instances of phonetic characters encoded in order to facilitate creation and publication of certain databases for specialists, without burdening them with instant obsolescence (if they had used PUA characters). But work is still being performed to implement the characters ans start using it massively, even if it's not encoded. I think this entire line of discussion is rather drifting into irrelevant details. Yes, I agree that it should matter whether serious resources have been committed in support of a new symbol or new piece of notation. That forms part of the evidence that marks some of these exceptional cases as viable standardized characters - despite lack of prior, widespread use. That, somehow, was my point. However, I find it pointless to speculate about the details. Exceptions are exceptions, and the most important issue is to reserve the flexibility to deal with them, when they arise. After they have arisen, they are best dealt with on a case-by-case basis (or in the case of currency symbols, we now have an entire category for which there is consensus hat it merits exceptional treatment). A./
Re: Missing geometric shapes
On 11/11/2012 2:08 PM, Doug Ewell wrote: Personal opinions follow. It looks like the only actual use case we have, exemplified by the xkcd strip, is for a star with the left half black and the right half white. There *might* also be a case for the left-white, right-black star. Precedent is for encoding these in pairs and if there were any doubts about the wisdom of this Simon Montagu's mail illustrates the bidi ramifications (thanks to Frederic Grosshans for the reminder). So, lets not prevaricate any longer and admit we have a a priam facie use case for the pair. Everything else, including one-quarter and three-quarter stars, rendering tomatoes or doughnuts or film reels as glyph variants of stars, facilitating a right-to-left rating system for Arabic- or Hebrew-speaking environments, or turning Unicode into a standard for rating systems in general, is a complete flight of fancy Flights of fancy, indeed. I couldn't have said it better. I think in this case, as in many others, one introductory, exploratory proposal would be worth ten thousand speculative mailing-list posts. You said it. A./
Re: Missing geometric shapes
On 11/11/2012 3:01 PM, Asmus Freytag wrote: On 11/11/2012 2:08 PM, Doug Ewell wrote: Personal opinions follow. It looks like the only actual use case we have, exemplified by the xkcd strip, is for a star with the left half black and the right half white. There *might* also be a case for the left-white, right-black star. Precedent is for encoding these in pairs and if there were any doubts about the wisdom of this Simon Montagu's mail illustrates the bidi ramifications (thanks to Frederic Grosshans for the reminder). So, lets not prevaricate any longer and admit we have a a priam facie use case for the pair. priam-prima Everything else, including one-quarter and three-quarter stars, rendering tomatoes or doughnuts or film reels as glyph variants of stars, facilitating a right-to-left rating system for Arabic- or Hebrew-speaking environments, or turning Unicode into a standard for rating systems in general, is a complete flight of fancy Flights of fancy, indeed. I couldn't have said it better. I think in this case, as in many others, one introductory, exploratory proposal would be worth ten thousand speculative mailing-list posts. You said it. A./
Re: Missing geometric shapes
On 11/11/2012 4:50 PM, Philippe Verdy wrote: 2012/11/12 Kent Karlsson kent.karlsso...@telia.com mailto:kent.karlsso...@telia.com rendering tomatoes or doughnuts or film reels as glyph variants of stars, They should certainly **NOT** be treated as glyph variants of stars! Ever! Who said that ? NOT me. If you think so, this is a misinterpretation in what I said You wrote so many things that it's impossible to be sure what you said. :) A./
Re: Missing geometric shapes
On 11/11/2012 8:47 PM, Philippe Verdy wrote: No, I was clear throughout, using the same arguments, that encoding things for the purpose of representing empty, full, half filled like if it was a nuemric gauge was a bad idea. Trying to encode a gauge is indeed a losing proposition. When I spoke about the various represetnations of gauges (including with photos) it was just to demonstrate that this is a domain where designers and authors are extremely creative, and there's absolutely no standard way of doing things right as each representation is a pure local decision. However, there's no argument that stars are used as symbols, including half filled ones. Stars are part of our family of geometrical shapes, and those shapes also have many members that are partially filled. There's no reason to pass judgment on why people might be using stars. Just consider the case of the classification of hotels and campings: they are just given an integer number of stars, and whever these stars are white (hollow/transparent filling), black (completely filled), multicolor, or even half filled does not change the classification. And this is where the discussion leaves the plane of encoding and veers into the realm of orthography. Orthography, loosely understood in its wider sense, is the realm of conventional use of written symbols. Orthographies associate conventional meaning to symbols and sequences of symbols (not just letters and words, but also punctuation marks etc.) Unicode's role has to be strictly limited to providing building blocks. Now if you think about half-filled stars, there are also the case of half-cut stars too (left side or right side shown) to represent as well half units. Or stars with only 1 to 4 branches filled, with variation about the position where branches are cut : in the middle of a branch, creating a thiner triangle. Or between branches (that are kept as complete diamonds extending up to the center. Variations as well in the number of branches for the star itself. Correct - there are many designs for stars and variations of those designs. And also correct, there is at some point a limit where you don't need a standardized encoding for all of these, because at some point, things will get so specialized, that few users will be able to benefit from this standardization. However, the half-filled, five pointed stars are garden-variety type symbols, and, as I keep pointing out, they absolutely fall within the scope of geometrical symbols for which there is ample precedent supporting both plain text usage as well as a standardized encoding. The suggested characters (they haven't actually been formally proposed yet) would in no way push the envelope. (skipping over lots of text that I think is not very relevant) We should only encode characters that users would reliably draw manually using a plum or rollpen, independantly of color, or of the width of the tool used to draw strokes, or possibly to fill them : basic orientation of glyphs however will be a candidate if its variation in the same text orientation is significant (this includes mirrored, or upside down characters, or significant changes of size and position relative to the baseline. Some exceptions are given to maths symbols (including letter-like) which are encoded specifically with their maths semantics for use in maths, but not for general purpose text. This is an entirely novel theory of encoding, and one, that I would like to point out, is very much your personal view. It does not have a foundation (or echo, or equivalent) in anything that really defines how encoding is done for the Unicode standard. A./
Re: Missing geometric shapes
On 11/11/2012 9:26 PM, Philippe Verdy wrote: 2012/11/12 Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com However, the half-filled, five pointed stars are garden-variety type symbols, and, as I keep pointing out, they absolutely fall within the scope of geometrical symbols for which there is ample precedent supporting both plain text usage as well as a standardized encoding. I oppose your argument of garden-variety type symbols because consistancy of this usge with a defined pattern is not demonstated, included in the precise domain where they are found. None of the geometric symbols have a precise domain where they are used. Typical for these symbols is that they have a wide variety of use and that therefore any encoding that tries to tie these characters to only some specific usage is doomed to fail. That does not mean that it's not important to show that there is at least one usage for that that is consistent with plain-text. The suggested characters (they haven't actually been formally proposed yet) would in no way push the envelope. [1] We should only encode characters that users would reliably draw manually using a plum or rollpen, independantly of color, or of the width of the tool used to draw strokes, or possibly to fill them : basic orientation of glyphs however will be a candidate if its variation in the same text orientation is significant (this includes mirrored, or upside down characters, or significant changes of size and position relative to the baseline. [2] Some exceptions are given to maths symbols (including letter-like) which are encoded specifically with their maths semantics for use in maths, but not for general purpose text. This is an entirely novel theory of encoding, and one, that I would like to point out, is very much your personal view. It does not have a foundation (or echo, or equivalent) in anything that really defines how encoding is done for the Unicode standard. [1] The first part is a good real-life expression of what is meant by abstract character and the fact that we don't encode glyphs. So this is not so much a novelty (this is stated in the standard that we don't encode glyphs but only abstract characters independantly of orthogonal styles and tools used to render them). This is not how abstract characters are defined. [2] The second part is the expression of the exceptions that have been made ONLY because there REALLY was a well-defined pattern of usage where the additional meaning of a precise style is consistant (and really HAD TO be)... Allowing then these exceptions (the other exceptions have been for interoperability with older character sets for terminals that had almost no graphic capabilities). So this is also not a noverlty. For now we lack the evidence of a consistant meaning in any given domain (not too much specialized to a single source at a single place for this consistancy). This whole thing comes down to a misunderstanding of semantic in the context of the statement that abstract character represent a semantic over a presentational aspect. The semantic of a letter a' is it's a-ness - in contrast to all the other letters of the same script. The semantic of an integral sign is integral sign in contrast to the other mathematical operators. (If there were two alternate notations for integral, then Unicode would not encode the concept of integral, but the several concrete symbols used to denote that concept - see for example, element of, where there is such variation). The semantic of a FULL STOP is to be a dot on the line. It can represent many different concepts (from sentence period, to abbreviation, to domain name separator to decimal mark, but all of these are a matter of convention external to the encoding standard.) Finally, the semantic of geometrical shape is in essence the shape - how it is used in the context of text will give it additional meaning, but those meanings are not what is standardized. A./
Re: Missing geometric shapes
On 11/12/2012 10:13 AM, Philippe Verdy wrote: 2012/11/12 Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com On 11/11/2012 9:26 PM, Philippe Verdy wrote: 2012/11/12 Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com However, the half-filled, five pointed stars are garden-variety type symbols, and, as I keep pointing out, they absolutely fall within the scope of geometrical symbols for which there is ample precedent supporting both plain text usage as well as a standardized encoding. I oppose your argument of garden-variety type symbols because consistancy of this usge with a defined pattern is not demonstated, included in the precise domain where they are found. That does not mean that it's not important to show that there is at least one usage for that that is consistent with plain-text. That's exactly what I meant. There must be at least one precise domain where this usage is consistent. No, there's no need for usage to be consistent. The only requirement is that it occurs. Unicode is not designed to be in the business of what people write, only in the business of enumerating the basis elements (written signs) needed for that communication. In some cases, a wide variety of shapes will be understood to represent a single written sign, with the alternation being stylistic. That's the case you have with letters and fonts. In other cases, it is not possible to a-priori, or reliably, or ever to decide what variance in shape can legitimately happen under the umbrella of a single written sign (as conventionally understood). At some point, all you have to go on is the shape itself. Whether an arrow is barbed or not, single or double stroked, filled or outlined makes no difference in its basic identification as arrow and no difference at all when it is used to merely point. However, different contexts (mathematics for one) have ascribed conventional meanings to some of the various appearances. In order to make the case for encoding them, the primary task is to show that they can and will be used in contrast. If that can be shown, the details of what each style represents is of lesser importance. Those details come into play when the use is not so much one that is in contrast with the generic usage, but one where a convention arbitrarily requires a specific shape - and followers of that convention will not recognize a generic substitute as being the particular written sign in question. A./ I certainly NOT meant ONE AND ONLY ONE. So all the rest about the (for example) use of the full stop for various purposes is not relevant: at least some of these uses are consistent in their domain. But for now we've not seen any one for the half stars, and I don't know why you think they will be more important to encode than the various other representations of ratings or similar concepts like gauges which largely overwhelm in all these many variants seen the particular cases where an half star MAY very infrequently be used without any consistency, as if it was a sort of standard (the purpose of Encoding in Unicode is to endorse such existant standard or norm, either national or international, or adopted by a measurable community over some large enough period, and not in isolated documents, whatever their medium, electronic or physical).
Re: Caret
On 11/12/2012 1:27 PM, Khaled Hosny wrote: I’m not sure from where you are getting your statistics, but I’ve to deal with all those “rare” and “extremely rare” situations all the day. Khaled, don't mind Philippe - his experience is a bit on the theoretical end. A./
Re: Caret
On 11/12/2012 7:13 AM, David Starner wrote: On Mon, Nov 12, 2012 at 4:39 AM, Julian Bradfield jcb+unic...@inf.ed.ac.uk wrote: Again, it depends. A user-oriented editor will treat é as a single unit anyway, for text manipulations. In my programmer-oriented editor, when the cursor is on e or ́, the two codepoints are displayed separately instead of combined, so again there is no ambiguity. What do non-English speaking programmers do? It seems that if I spoke good Hindi or Arabic and little to no English, it would be deeply frustrating to try and use comments and strings in such an editor. As a programmer, you do want to be able to edit *and view* strings as sequences of code units. Doing so only in the contest of binary memory dumps gets tedious. (In English, this means, for example, being able to view whitespace easily - a task that too many editors make hard). For typing comments and strings, the display would not be an issue, because any partial characters would be handled the same way as in regular word processing. Editing the middle of a word might be different, but smarter editors could turn that feature off for comments. For strings it's something you'd want more often - depends a bit on what programs you are writing. When inspecting strings I certainly would want to be able to distinguish between precomposed and decomposed e-accent, and whether I know English gots nothing to do with it. A./
Re: Missing geometric shapes
In the business of character encoding, it's not helpful to try to construct algorithmic rules that lead from one set of conditions to the state of encoded. It just doesn't work that way. What does work is to think of factors, or criteria, that you can use in weighing a question. Certain factors weigh in favor of encodings, others don't (or have large negative weights - logo's currently have infinite negative weights :) ). Many of these criteria managed to get written down in the Policies and Procedures document and have been helping Unicode and WG2 decide encoding questions. Others are still mainly present in the collective consciousness of the encoding committee. Such is life. What's not helpful is for outside observers to propound theories of encoding that are seemingly based on more algorithmic foundations, or that embody more rigid or formulaic requirements for this that an the other thing. It's not that meeting certain requirements isn't helpful in advancing the case for encoding a character or symbol, but rather that it works only by increasing the weight in favor, not by flipping a switch up or down. It's really important to not mischaracterize the nature of the character encoding business in this way. That's all I want to contribute to the current thread. A./
Re: xkcd: LTR
On 11/27/2012 5:39 AM, Masatoshi Kimura wrote: (2012/11/27 20:27), Philippe Verdy wrote: Could you please stop spreading an unfounded rumor such as Firefox is wrong because it ignores the lacking of HTML5 prolog? Getting Philippe to stop spreading unfounded anything is a near impossible task. :) A./
Re: UTF-8 ill-formed question
On 12/11/2012 11:50 AM, vanis...@boil.afraid.org wrote: From: James Lin James_Lin_at_symantec.com Hi Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow the pattern of UTF-8 byte-sequences, i just wondering how or why? If i have a code point: U+4E8C or 二 In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this BA comes from? thanks -James Each of the UTF encodings represents the binary data in different ways. So we need to break the scalar value, U+4E8C, into its binary representation before we proceed. 4E8C - 0100 1110 1000 1100 Then, we need to look up the rules for UTF-8. It states that code points between U+800 and U+ are encoded with three bytes, in the form 1110 10xx 10xx. So plugging in our data, we get 4 E8 C 0100 1110 10-00 1100 // \\ + 1110 10xx 10xx = 11100100 10111010 10001100 or E 4 B A 8 C -Van Anderson Nice! A./ PS: I fixed a missing \
Re: wrongly identified geometric shape
In relating the size of different series of geometric shapes with each other, the relevant aspect is not the height of the ink but the area, in my opinion. I'm currently not able to take the time to sift through various documents and propose any resolution, but I would like to make sure that this point is not lost. A diamond of the same height as a square, becomes effectively an inscribed diamond. When you compare the areas, the difference is 50% (!) Seen in running text (and not next to each other) the diamond will look smaller, even though it has the same height. For the other shapes, the same effect exists, but is not always as severe. Whether one matches the size of a hexagon with that of a pentagon by height or area, for example, may not result in obvious and observable difference in the impression of their relative size. Ideally, as an author or font designer, I would aim for a set of symbols that have the same optical weight, or impression of weight. Shapes that are more compact, might be allowed to have a little more area, because they might otherwise look short. But in this balancing act, I would expect the most functional (and pleasing) choices for mathematical use to be those where the shapes end up rather closely (but perhaps deliberately not perfectly) matched in area. As to whether the exact size progression within each series is best realized as a geometric or linear, or some other progression, I can't suggest a definite answer right now. In terms of text sizes for CSS, the concept of fixed ratios seems to be prevalent. Ideally we would have some more input from mathematical typesetters and font designers. Whatever the progression ends up being would require that all steps can be distinguished in on-screen viewing at some point size (and traditional, not bleeding edge, set of DPI values). A./ On 12/17/2012 3:37 PM, Michel Suignard wrote: Philip, It would have helped if you had updated your critique of N4115 to the current proposed code points. The updated version is N4384 (L2/12-368). The number of characters proposed and their allocation have changed although the status for geometric shapes has no changed much. I spent some times analyzing your documents and I can see you are trying to harmonize the size of the diamond and the square shapes by applying the concept that the length of a side should dictate the ‘size’, not the ink height. By doing so you force the rule found on small sizes to the larger sizes which makes you deviate from the current TR25 recommendation, basically you are sizing down all the squares to match the diamonds. For example, now the regular size square side would be slightly above half the EM box side, which is what a medium square is today. And at the end you still have to add a new XL size which is not part of TR25. I also looked at the current font implementations of squares and they are all over the place in relative sizes but all have bigger sizes than what you propose. By far the more consistent set is the Wingdings set, but there are some many size inter correlations in geometric shapes that I can’t just put them in the charts. What I have found consistently among implemented fonts is a large gap between ‘small’ and ‘very small’ which reinforce my introduction of ‘slightly small’. As long as we don’t try to force the diamond scaling onto the square scaling I don’t see an issue with the current schema. The name ‘slightly small’ is not exactly pretty but we are running out of adjectives here. If you made the same arguments a year or more ago, it would have been easier to influence the content of amendment 1, now it is quite late. Geometric shapes representation is always subjective and various schemas can be used. The one use in 4115 does not try to merge the size scale between square and diamonds (I don’t think there was a mandate to do so). Another goal was to take into consideration existing practice among math fonts. None of that is cast in stone, and I am sure we will see more fine tuning when math fonts implement the full set of these geometric shapes. The mapping of Wingdings/Webdings into Unicode is not frozen and TR25 is still a work in progress. Always open to civilized discussion. Using terms such as ‘idiot’ and ‘the arithmetic involved shouldn't challenge the average 12-year old’ will guarantee no answer from my part in the future. Best regards, Michel *From:*philip chastney [mailto:philip_chast...@yahoo.com] *Sent:* Sunday, December 16, 2012 10:45 AM *To:* Michel Suignard *Cc:* unicode List *Subject:* Re: wrongly identified geometric shape On 2012/Dec/08 02:34, Michel Suignard wrote: * From:*philip chastney anybody converting a document currently using Wingding fonts to one using Unicode values and Unicode fonts instead, using the transliteration proposed in N 4384, will find their squares somewhat diminished in size (in this case, by one third) this is
Re: wrongly identified geometric shape
On 12/17/2012 10:55 PM, Michel Suignard wrote: Asmus TR25 today takes an intermediate approach (ref page 19), the diamond exceeds the height but its sides are smaller than the ‘equivalent’ square. Which is what I suggested below. In fact, in smaller sizes, there is equivalences between sides of diamond and squares, but in larger sizes the square sides become increasing larger than the diamond sides. At some point, you can’t fit the diamond into the EM box without bleeding over the edges which is not acceptable. In a true mathematical font you have very tall glyphs that are not appropriate for a mere symbol or dingbat font. Just consider the integral signs. If you take the ink area rule to the letter, you would have to decrease significantly the ink for the larger squares to match the diamond ink for the same ‘size’. Again, if you look at the current version of TR25 (page 20), the ink for the diamonds is quite smaller than the ink for the same ‘sized’ squares in the context of large sizes. In an ideal world we would define all geometric shapes consistently, but they were created in an ad hoc manner and it becomes increasingly difficult to define consistent and uniform rules without creating regression issues. It is inherently difficult to use the same scale progression between shapes that fit nicely the EM box (squares and circles) and spiky shapes (diamond, lozenge). Added to that, the STIX fonts which are a common implementation of math symbols tend to size squares and circle even larger than the Unicode charts. So any effort to size down these shapes (implied by an alignment with the diamond sizes) would go opposite from current practice. I think trying to solve this on the character encoding level, without double checking that with the wider mathematical/typographic community is a mistake. We did some outreach when we came up with the specifications in the original TR#25, but it may well be that this could use some updating based on new input, new experience and the new characters. STIX is an important element for this, but perhaps not the only one - if you know you are using the STIX fonts you can adjust your styles or glyph (character) selection to tweak the outcome, something that isn't an option for a generic prescription. A./ Michel *From:*Asmus Freytag [mailto:asm...@ix.netcom.com] *Sent:* Monday, December 17, 2012 6:22 PM *To:* Michel Suignard *Cc:* philip chastney; unicode List *Subject:* Re: wrongly identified geometric shape In relating the size of different series of geometric shapes with each other, the relevant aspect is not the height of the ink but the area, in my opinion. I'm currently not able to take the time to sift through various documents and propose any resolution, but I would like to make sure that this point is not lost. A diamond of the same height as a square, becomes effectively an inscribed diamond. When you compare the areas, the difference is 50% (!) Seen in running text (and not next to each other) the diamond will look smaller, even though it has the same height. For the other shapes, the same effect exists, but is not always as severe. Whether one matches the size of a hexagon with that of a pentagon by height or area, for example, may not result in obvious and observable difference in the impression of their relative size. Ideally, as an author or font designer, I would aim for a set of symbols that have the same optical weight, or impression of weight. Shapes that are more compact, might be allowed to have a little more area, because they might otherwise look short. But in this balancing act, I would expect the most functional (and pleasing) choices for mathematical use to be those where the shapes end up rather closely (but perhaps deliberately not perfectly) matched in area. As to whether the exact size progression within each series is best realized as a geometric or linear, or some other progression, I can't suggest a definite answer right now. In terms of text sizes for CSS, the concept of fixed ratios seems to be prevalent. Ideally we would have some more input from mathematical typesetters and font designers. Whatever the progression ends up being would require that all steps can be distinguished in on-screen viewing at some point size (and traditional, not bleeding edge, set of DPI values). A./ On 12/17/2012 3:37 PM, Michel Suignard wrote: Philip, It would have helped if you had updated your critique of N4115 to the current proposed code points. The updated version is N4384 (L2/12-368). The number of characters proposed and their allocation have changed although the status for geometric shapes has no changed much. I spent some times analyzing your documents and I can see you are trying to harmonize the size of the diamond and the square shapes by applying the concept that the length of a side should dictate the ‘size
Re: Character name translations
On 12/20/2012 2:52 AM, Martinho Fernandes wrote: Hello, I was wondering if there is a list of character names translated into other languages somewhere. Is there? A French list was created, and for a while maintained with funding from the Canadian government. It covered the complete list of Unicode names for the version of Unicode at the time. It was hosted at the time on the Unicode site - there were issues because it's no longer fully up-to-date. Don't know the status. There was a subset list of names based on a much earlier version of the Standard, in Swedish. Have no idea where that is accessible, if anywhere. There have been efforts at a Japanese translation of the text of the standard, I have no idea whether that contains translated names for characters. For many scripts, the character names consiste of a a prefix identifying the script, a desingator that distinguishes basic classification such as capital/small letters, vowels, consonants, and a part that is a often some transliteration of the character. After translating the script name and the few words for these designators, what remains is the selection of an appropriate transliteration scheme for that script in the target language. For most of these elements, existing translations should exist, and be easiliy accessible from the usual dictionaries and online resources, except perhaps for the script names. Punctuation marks and symbols tend to have more detailed names and present more issues to a translator. In all the translated lists that I have seen it has been customary to use all uppercase letters, but allow the use of accented characters - essentially replacing the notion of A-Z with something like the basic alphabet for the given language. Some languages, may require certain punctuation marks, in addition to hyphen, because these marks form part of the words used for traditional names of characters. In many instances, translators have chosen to provide a new name for a character in the target language, usually based on a common name, or in analogy to other names in that language, rather than to translate word-for-word the English name. It is unclear whether all languages benefit from an effort to translate all character names in the Standard, but having a cross reference of character codes to local names for widely used characters (or those of regional importance) seems a worthy goal. Character names serve two purposes, which are sometimes at odds. One is to simply act as formal identifiers that are more or less mnemonic (which the hex codes are not). The other is an aid in identifying a character, as an aid in look-up or selection. For the latter case, the formal names can be insufficient, because at times they are very arbitrary and don't represent the most common name, or because there isn't a single, common name for the character. The French translation therefore wasn't limited to the character names, it translated the full character names list (what is used to print the code charts) with all the alternate descriptions (aliases) and annotations for the characters. Once you do that, it's clear that the work is indeed useful to ordinary users, because you enable them to search for a character by some word in their own language, and it is no longer a question whether you are translating some pure identifiers. A./
Re: Character name translations
On 12/20/2012 7:26 AM, Leif Halvard Silli wrote: Andreas Prilop, Thu, 20 Dec 2012 15:41:28 +0100 (CET): On Thu, 20 Dec 2012, Jukka K. Korpela wrote: http://www.ling.helsinki.fi/filt/info/mes2/ Unicode names have certain restrictions (capital ASCII letters, etc). This Finnish list even uses non-ASCII characters but sticks to capital letters. Why no small letters if non-ASCII letters are allowed? Which characters could be used for a Russian translation? Cyrillic letters? Only capital letters? If so — why? My impression is that Unicode character names are limited to - in order of priority: 1. language (en-US) 2. character set (US-ASCII) 3. uppercase Language Letters + digits + some punctuation UPPERCASE What is the basis for the choice of uppercase? The probable answer might be that it sticks out. It makes the name appear as code rather than ordinary words (which could thus lead to mistakes: Is it a word or a code? The same way of thinking *plus* a desire to look like Unicode, could justify why translations in to e.g. Finnish and Russian would apply the same rules. If you take the Unicode character names in the context of OTHER information about the character, as presented in the Unicode character nameslist (code charts) for example, then being able to distinguish the formal names (UPPER CASE) from informal aliases (lower/mixed case) is very handy. In my other message, I made clear that I think translations of just the names is a lot less useful than translation of the full information presented in the code charts, which includes block (and therefore script) names, annotations and listing of alternate names by which these characters are known to ordinary users. If your language uses a bicameral script, then the easiest way is to follow the same typographical conventions (or analogous ones) as the original text. A./ PS: some languages use punctuation in forming words, If avoiding such use would make the names appear artificially restricted, such use might be allowed in addition to HYPHEN-MINUS for such a language. PPS: ideally, the translated character names obey the same uniqueness under analogous 'loose matching rules as the original character names, and where formally published by a standards organization as 'national' version of 10646, one would expect similar guarantees for name stability.
Re: Character name translations
On 12/20/2012 2:36 PM, Jukka K. Korpela wrote: 2012-12-20 14:13, David Starner wrote: It may be useful to try to agree on official or semi-official names for characters in a language. Such a list hardly needs to cover all of the over 100,000 Unicode characters. Why not? Why should an English speaker sticking a arbitrary character into a character map program get a name for it but a non-English speaker not? For most characters, a “translated” name would be arbitrary. I would compare this to names of biological species. Most species lack names in most languages, and when names exist, they are often vaguely and inconsistently used. But when real people, not biologists, want to look up information they have precisely two choices: they can look at a visual index (for species that can be arranged visually) or they can look up the scientific name for the species based on the only thing they know: the local popular name. That’s why people use scientific (Linnaean) names. We use common names for common animals, but it just would not make sense to assign a name to the millions of insect species in each human language. The scientific name is a crucial key to information. With Unicode characters, both the number and the name act as such keys, though the name is usually descriptive of meaning, too. Unlike species, all characters for living scripts have popular local names in at least one language other than English. It may not be desirable to blindly translate ALL such names into ALL languages, but major languages (not only English) may be used by people that are familiar with or study many other languages and scripts. For those languages, their community of scholars represents another set of users who benefit from translated names. Finally, for arcane scripts, there's usually an easily translatable part of the character name (think of LATIN LETTER SMALL) and an arbitrary part of the name (e.g. A) which comes from a transliteration scheme, a catalog number or the like. If a language doesn't have a unique transliteration scheme for a particular script, the choices are to either use the same as present in the Unicode Standard, or to use one from another, culturally more relevant language (e.g. a French-based instead of and English-based transliteration). So Unicode names should not be translated at all, any more than you translate General Category values for example. Why wouldn't you? Because those values are identifiers. No, names have multiple uses; especially if you take the formal name as one in a series of aliases for each character - that's why it's often more useful to think of translations of the full code charts and character index, instead of just the formal names. (The latter, by themselves are not so useful). There's an argument that they're generally useful for programmers only and programming often requires English knowledge, but if I were explaining the character categories in Esperanto, I would certainly say that Sm is matematikaj simboloj or Simbolo Matematika, not act like Symbol, Math should have any importance to my audience. We can and often should *explain* meanings of identifiers in different languages, but that’s different from naming things. The value “Sm” has a technical meaning, and it is not identical with the common-language expression “mathematical symbol” or its variants, though rather close. The linguistic content of the short labels is indeed limited, however, I can see good reasons to provide alternate abbreviations for characters, e.g. for ZWSP or WJ, because these terms are used in places where they do not act as identifiers. A./
Re: Character name translations
On 12/20/2012 5:12 PM, Philippe Verdy wrote: Given the form of these names in the UCD, most of them could be translated automatically using a common dictionnary and resolving some terminologies that are approximative in Unicode. If you mean by that: can usefully be done with a translation memory, I'd agree, but nothing fully automatic, I'm afraid. But translated names should not be capitalized and not restricted to plain ASCII (including in US English, that is not really the human language used in standard names of the UCD but names in a computer language like in the default C locale). Why, Why not? The reason existing translations used uppercase is because they were trying to translate existing documents (e.g. code charts) where that convention was used. If you merely create some DB for other purposes, by all means, use what's appropriate. There's just no one size fits all one way or the other. A./
Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs
On 12/23/2012 3:55 PM, Joó Ádám wrote: Roger, thank you for sharing this excerpt, I truly enjoyed it. You drew my attention to a book I should definitely have a look at. It's definitely a nice way to introduce people or remind them of this book. I'm sure, some misguided publishers would like if one got permission to even quote the endorsements from the cover text, but I find that attitude silly and, frankly, counter-productive. If anything, the combination of this particular excerpt and source should help to generate more interest in people to obtain the book if they don't have a copy yet. It was not as if Roger gave away the plot, or pulled out the only memorable part of the book. I must agree with Karl: I was suprised by Jukka’s reaction, since this kind of quotation is both legally and ethically unquestionable here, in the very center of Europe... Glad to hear that. I also agree with the points that Karl had raised. I am not willing to be silent when what I perceive to be bullying is expressed on this list. This should be a safe place for any newbie to post. I found an unwarranted aggressiveness in Jukka's response to Roger's apparently well-intentioned post. unicode.org is based in the USA. As another poster said, this quotation would be considered fair use under USA law. Quoting like this is extremely common in USA writings. The post uses only US-ASCII. I'm sure that Jukka knows that US-ASCII does not have an EM dash. The standard I was taught in school (in the USA) was to represent an EM dash in such situations precisely as the original post does, as a sequence of two hyphen-minuses. I do not believe that either the EM dash nor the miscapitalization of a word constitute distorting the text, and I find it difficult to believe that Jukka really does either. Therefore I believe that Jukka was not being honest in his response to the post; it appears to me that he concealed the real reason he objects to it. I could be wrong, and perhaps there are cultural differences between the USA and Finland that are being unconsciously expressed here. But I can tell you that as a native USA English speaker, I found nothing wrong with the original post. And I found Jukka's response objectionable. There are cultural differences in the way users on online forums (and lists) correct each others spelling and punctuation. I'm thinking of a few examples, but even the more relaxed ones will try to encourage some minimal standards, like avoiding ALL CAPS, reduce the use of totally random spelling, or introducing minimal number of paragraph breaks. Some of these things just grate on peoples ears and there's always someone who says enough and posts some suggestions for the newbie. Usually, the tone of such messages fits the style of the list. There are cultural differences there as well (culture of the group, that is). The fact that Roger's post was a quote wasn't clear to me until I reached the attribution at the very end. Had I stopped reading half-way through, I would have attributed the clever words to him. I'm sure. that was not his intention. Because of that, though, there's a reason that the reader may find the use of the quote subconsciously more questionable, then if Roger had given the source up front (and perhaps included a sentence or two of his own whether he can recommend the book and why). Sometimes a post can rub somebody else the wrong way. Something that may have less to do with what's in the post, but with the state the reader is in when he comes across it. A./
Re: When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs
On 12/23/2012 10:58 PM, Asmus Freytag wrote: It's definitely a nice way to introduce people or remind them of this book. There's a word missing in that sentence.
Re: Interoperability is getting better ... What does that mean?
On 12/30/2012 1:22 PM, Costello, Roger L. wrote: Hi Folks, I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Do you have data to back up the assertion that interoperability is getting better? The number of times that I receive e-mail or open web sites in other languages or scripts WITHOUT seeing garbled characters or boxes has definitely increased for me. That would be my personal observation. More people are sending me material in other scripts and languages. whether on this list or via social media. Interoperability as measured in those terms has clearly improved as well; again, as experienced personally. I still see the occasional garbled characters, most often because of a Latin-1/Latin-15 mismatch with UTF-8. Interoperability is not perfect. There's also no real reason to continue to create material in those 8-bit sets, especially, if the data is mislabeled as UTF-8 (or sometimes vice versa). In my experience, the rate of incidence for these appears to be going down as well, but I'm personally not running an actual count. I can imagine that there are places (and software configurations) that expose some users to higher rates of incidence than I am experiencing. Rather than dissecting general statements such as whether Interoperability is getting better or not, it seems more productive to address specific shortcomings of particular content providers or tools. In the final analysis, what counts is whether users can send and receive text with the lowest possible rate of problems - and if that requires transition away from certain legacy practices, it would be important to focus the energies on making sure that such transition takes place. A./
Re: Interoperability is getting better ... What does that mean?
On 12/30/2012 3:19 PM, Leif Halvard Silli wrote: My feeling is that interoperability is getting better everywhere. But one field which lags behind is e-mail. Especially Web archives of e-mail (for instance, take the WHATwg.org’s web archive). And also some e-mail programs fail to default to UTF-8. Archiving seems to occasionally destroy whatever settings made the original work. I have seen that not only with e-mail, but also with forums that have a separate, archive format. Time to get those tools to move to UTF-8. A./
Re: Interoperability is getting better ... What does that mean?
On 12/31/2012 3:27 AM, Leif Halvard Silli wrote: Asmus Freytag, Sun, 30 Dec 2012 17:05:56 -0800: The Web archive for this very list, needs a fix as well … The way to formally request any action by the Unicode Consortium is via the contact form (found on the home page). A./
Basic Latin
On 1/1/2013 3:53 PM, Naena Guru wrote: (By the way, Unicode is quietly suppressing Basic Latin block by removing it from the Latin group at top of the code block page (http://www.unicode.org/charts/) and hiding it under different names in the lower part of the page.) I don't know what you mean here, you get it by clicking on the header Latin at the very top of the Latin group. The word basic was deemed redundant in the index (a choice that you can argue about forever - if space wasn't at a premium on that page, it might have been an easy decision to add an alias). A./
Re: Terminology: does the term codepoint apply to non-Unicode character sets?
On 1/1/2013 12:43 PM, Costello, Roger L. wrote: Hi Folks, Does the term codepoint apply to non-Unicode character sets? For example, are there codepoints in iso-8859-1? In Windows-1252? /Roger The short answer is yes. The term code point was in use for locations in IBM code pages long before Unicode was created; in the context of other standards, slightly different terms were in use, such as code location. (Windows-1252, while created by Microsoft, was registered in the IBM code page collection at the time, which assigned to it the number 1252, so the use of code point for that character set is definitely an extension of the earlier usage). It's worthwhile to make sure that if you operate in the context of some other standard, that you make sure you follow the terminology as defined there, but for general use, the word code point is not tied to or reserved for Unicode (but make sure you are clear which character set you are talking about). Both spellings, with and without the intervening space, can be found, but Unicode uses the term only without the space. A./
Re: Terminology: does the term codepoint apply to non-Unicode character sets?
On 1/2/2013 9:00 AM, Doug Ewell wrote: Asmus wrote: Both spellings, with and without the intervening space, can be found, but Unicode uses the term only without the space. This didn't sound right to me, so I checked the Glossary, and it lists the term as two words with a space. http://www.unicode.org/glossary/#code_point OK. There are a few terms where Unicode doesn't use the space. I could have sworn this was one of them, looks like I got it backwards. A./ -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell From: Asmus Freytag mailto:asm...@ix.netcom.com Sent: 1/1/2013 23:43 To: Costello, Roger L. mailto:coste...@mitre.org Cc: unicode@unicode.org mailto:unicode@unicode.org Subject: Re: Terminology: does the term codepoint apply to non-Unicode character sets? On 1/1/2013 12:43 PM, Costello, Roger L. wrote: Hi Folks, Does the term codepoint apply to non-Unicode character sets? For example, are there codepoints in iso-8859-1? In Windows-1252? /Roger The short answer is yes. The term code point was in use for locations in IBM code pages long before Unicode was created; in the context of other standards, slightly different terms were in use, such as code location. (Windows-1252, while created by Microsoft, was registered in the IBM code page collection at the time, which assigned to it the number 1252, so the use of code point for that character set is definitely an extension of the earlier usage). It's worthwhile to make sure that if you operate in the context of some other standard, that you make sure you follow the terminology as defined there, but for general use, the word code point is not tied to or reserved for Unicode (but make sure you are clear which character set you are talking about). Both spellings, with and without the intervening space, can be found, but Unicode uses the term only without the space. A./
Re: Basic Latin
On 1/2/2013 3:26 PM, Jukka K. Korpela wrote: 2013-01-03 0:22, Markus Scherer wrote: The page has been modified to add an alias for Basic Latin (ASCII) under the Latin heading. I can see that, but I don’t think it’s an improvement. It puts the Latin script in a special status. The special status results from the fact that nearly all other scripts don't use the word Basic but have a block for which the name is equal to the name of the script. The other exception is that this block happens to be the most looked-up block, so a small change accommodates many users. The purpose of the index page is to allow people to find what they are looking for, and when the are looking for Basic Latin because of the block name, they should not be required to do mental gymnastics to puzzle out where that block might be hidden. And it makes both “Latin” and “Basic Latin (ASCII)” links to the same page, violating fundamental accessibility principles: duplicate links should be avoided, and when they can’t be avoided, they should have exactly the same link texts. Nice principle, but, utterly misapplied. Look at any book index and you will find the same page (even passage) indexed under multiple terms - as appropriate. And, if you look at the page source for the chart index you will find that there are already several links to the same page in other instances, So this change is not some kind of dramatic departure. The original design was created the way it was based on considerations like the ones you raise here. Over time, evidence piled up that this was creating a usability problem. That has been fixed, so now we can all move along, nothing to see here. A./
Re: holes (unassigned code points) in the code charts
On 1/4/2013 2:36 AM, Stephan Stiller wrote: All, There are plenty of unassigned code points within blocks that are in use; these often come at the end of a block but there are plenty of holes as well. I have a cluster of interrelated questions: 1. What sorts of reasons are there (or have there been) for leaving holes? Code page conversion and changes to casing by simple arithmetic? What else? There are a number of reasons why a code chart may not be contiguous besides the reason you give. Sometimes, a character gets removed from the draft at last minute, In those cases, a hole may be left. In general, the possible reasons for leaving a hole can not be enumerated in a fixed list. It's more of a case-by-case thing. 1.1 The rationale for particular holes is not documented in the code charts I looked at; is there documentation? (Yes, in some instances the answer can be guessed.) In general, no. Sometimes, there's explanation in the text. 1.2 How is the number of holes determined? It seems like multiples of 16 are used for block sizes merely for practical reasons. Blocks end on a value ending in F in hexadecimal notation. 2. I notice that ranges are often used to describe where scripts are found. Do holes have properties? Are the other block-related policies that gives holes a certain semantics? There are default values for some properties that can be applied to unassigned characters in order to make an algorithm do the best with as-yet-unassigned characters (so that if a new character is created, the algorithm doesn't have to be reimplemented necessarily but still gives good results). There's no distinction between holes and other unassigned characters. 2.1 If not, how likely is it that Unicode assigns script-external characters to holes? It's generally not desirable, but there's no firm policy that blocks must have a single script value (and in fact, no such restriction exists in existing blocks). 2.2 If yes, how does the number of assigned code points differ, if holes that are assumed to be filled only by certain types of characters are counted? ??? 2.2.1 Would this make much of a difference wrt the question (this comes up from time to time it seems) of how much of Unicode will eventually fill up? If strong technical reasons exist for placing a character into the BMP, there will be temptation to fill a hole if the BMP is otherwise full. Likewise, many, many years (decades) from now, similar pressure might exist should the rest of the code space become filled. However, the most likely scenario is that Unicode will continue for an indefinite period with sufficient open space (and the occasional hole). 3. Have there been mistakes wrt to hole assignment? Unicode doesn't make mistakes. :) A,. Stephan
Re: Is that character *+A7AC LATIN CAPITAL LETTER SCRIPT G ?
On 1/10/2013 2:08 AM, Otto Stolz wrote: Hello, le 09/01/2013 18:07, Frédéric Grosshans a écrit : Yes, but I actually don't know. I'd really like to have some idea on those old printing techniques, but I fear we're drifting to off topic subjects... Am 2013-01-09 um 18:16 schrieb Frédéric Grosshans: Actually, the preceding tool combined with http://en.wikipedia.org/wiki/Mimeograph would be my best (uninformed) guess. I’d rather guess, he used this technique: http://en.wikipedia.org/wiki/Dry_transfer. I have used it myself, in the 70s, to insert all those Greek symbols into the formulae in my Dipl.-Phys. thesis. It renders much clearer glyphs than the mimeograph technique. Best wishes, Otto Stolz LetraSet (the market leader at the time) was indeed widely used by the 70s, but was this available as early as the date of the manuscript? The hallmark are aboslutely indentical shape, but with strong likelihood of small positioning errors (in both axes and rotation). The latter should show up on careful examination. Sometimes a letter could tear or the thin foild that could fold or crease upon transfer. Usually, in a careful production one would redo the letter, but sometimes such small imperfections survive - they look very different from defects in other forms of typography. A./
Re: Is that character *+A7AC LATIN CAPITAL LETTER SCRIPT G ?
On 1/10/2013 5:21 AM, Frédéric Grosshans wrote: Le 10/01/2013 11:08, Otto Stolz a écrit : Hello, le 09/01/2013 18:07, Frédéric Grosshans a écrit : Yes, but I actually don't know. I'd really like to have some idea on those old printing techniques, but I fear we're drifting to off topic subjects... Am 2013-01-09 um 18:16 schrieb Frédéric Grosshans: Actually, the preceding tool combined with http://en.wikipedia.org/wiki/Mimeograph would be my best (uninformed) guess. I’d rather guess, he used this technique: http://en.wikipedia.org/wiki/Dry_transfer. I have used it myself, in the 70s, to insert all those Greek symbols into the formulae in my Dipl.-Phys. thesis. It renders much clearer glyphs than the mimeograph technique. I don't think so, because it is a 'real book' ( http://books.google.fr/books/about/La_th%C3%A9orie_des_particules_de_spin_1_2.html?id=3qzvMAAJredir_esc=y ), which was printed in enough exemplars to be available 6 decades later in several libraries and on sale on internet for a reasonable price. The Dry_transfer technique do not seem adapted to such publication. One would apply the dry transfer to the original typescript. The book itself would then be printed by some photo-mechanical means (e.g. PMT). I was involved in some print publication in the early eighties where the original was created using a variation of a photo-typesetting machine which, however, just created a single column of text. The output from that was pasted up (together with graphics) and then then tranferred photo-mechanically onto a drum for offset printing. Something analogous could easily have been done to a high quality typescript with LetraSet for the special characters. The fact that the book uses a typewriter-like font for the running text seems to hint at that. (Some later typewriter ribbons used a technique similar to the dry transfer, and unlike the inked ribbon for which early typewriters are known) I don't remember ever learning the proper terms for all of these things, but it should be easy to find those buried in Wikipedia somewhere. A./ Frédéric
Re: help with an unknown character
http://ts2.mm.bing.net/th?id=H.4791646751032057pid=1.7w=176h=155c=7rs=1 ?? Relation? Visual or otherwise. Pun? (Note the similarity :widder: :wider:) Just thinking out loud. A./
Re: help with an unknown character
On 1/16/2013 5:35 PM, Philippe Verdy wrote: Fair enough. It's not a problem to ask the question, Is this a candidate for encoding? It becomes a problem when the poster assumes, because thet blob appeared in such-and-so location, that it MUST be a candidate for encoding, and no level of argument about the character/glyph model, or the need to interchange the blob, or anything else, will change that person's mind. Was there any sign of such assumption in the original question sent by Elbrecht ? He just asks for help, nothing else. He does not request a new encoding. He just speaks about something he found for which there's no easy mapping to Unicode. Where Phillipe is right, he is right. Yes,there are a few very obstinate individuals, but they are well known. However, it seems, that frequent interaction with them has given the list an allergic sensitization. That is unfortunate. It should be possible to come to the list, even if one is convinced the sign, symbol or letter is new to Unicode. I would even claim that most people who post here are discouraged by the negative reaction anyway, and never file a submission - even if their case has merits. Heck, even the obstinate ones don't always get around to filing a submission :) The proper place for this list is to offer discussion, background and advise - it's not the ruling body and final determination what is or is not a valid character belongs to the proper committee like the UTC. Something that occasionally gets forgotten.
Re: Spiral symbol
On 1/21/2013 4:11 PM, Andrés Sanhueza wrote: Hello. I have wondered if it may be a good idea to make a proposal to an spiral character, basically because I believe is the only mayor symbol recurrently used for represent swearing in comics that's missing from Unicode. If it should come to a proposal, I can help out with one or two citations of the use of this symbol for that purpose in contexts that are not that different from other lettering in the same sources. Not more than emoji are from regular words. A./ Most of the time it is replaced with the more common at (@), but still an actual one may be good. Not sure yet if there's enough documentation. Some Emoji representations displays the CYCLONE character (U+1F300) as one, yet I don't think that fits as a better replacement. Andrés Sanhueza
Re: End of story character
On Thu, 24 Jan 2013 20:05:41 -0300 Andrés Sanhueza peroyomasli...@gmail.com wrote: Do you think that a end of story symbol may be feasible/useful? My position is that the attempt to encode such semantics that are defined on the whole text level is a mistake. In fact, it is a common mistake that keeps surfacing in proposals or tentative proposals. When Unicode encodes semantics, it's on the level of individual symbols. If there were a recognized notation that defined an end of text symbol, then you could encode that in Unicode, and expect that to be rendered with ordinary stylistic variations (governed by font selection - with the font not selected just for that symbol, but once, for all aspects of that notation). Such a use would then be analogous to something like the integral sign, which has a (small) range of customary and conventional shapes, e.g. upright or slanted, bulky or slender, which fall into what anyone would consider stylistic variations. The precise variation is usually selected by choice of font not just for the integral, but a whole set of other mathematical symbols as well (the full notation in fact). Placing a symbol of some sort at the end of a text is a fairly widespread convention, but there is no agreement on any set or range of customary shapes for that purpose. In a way, that makes this convention less a notation, but something different. In some ways it's more similar to the way that languages may agree on representing the concept house as a known, albeit with completely different sets of shapes (house, Haus, hus, maison etc.). For languages, those representations would be called spellings, and I think that's the appropriate concept as well for the end of story convention. Rather than conceiving of it as a single character with a range of glyphs, it's a convention on the whole text level that is customarily expressed by different spellings (choice of abstract or pictorial symbol). Just as Unicode does not unify spellings, the different choices of symbol for end of story should remain disunified. Each user of the convention decides on an appropriate character or symbol for the purpose. (Another analogy would be list item markers which are equally not unified into a generic control code with glyph variants, but are separate characters). Because the semantic of the convention is not directly represented / representable on the encoding level, there's also no need to encode multiple characters of different shapes such as end of story-1, end of story-2 etc. Instead, like the use of . or , for decimal point, the semantic of end of story comes from context. Whenever a symbol is placed consistently at the end of every story in a collection, that symbol acquires the end of story semantics. There are cases where Unicode has duplicated characters (using the same shape) based on which convention they happen to be used with. All these duplications are problematic in many contexts, however well intentioned they may have been. These cases make poor precedents and must be properly understood as the exceptions they are. The general encoding principle in Unicode remains that Unicode does not encode spelling - which means that symbols and other characters can be put into new contexts and there acquire new semantics to the human reader - without requiring the addition of dedicated code points. With this, we can turn back to the original question. Should an end of story character be encoded? The answer must be negative. However, if particular shapes have been in widespread enough use for that purpose, but are not yet encoded in Unicode as their own symbol, then encoding such symbols for general use would be appropriate. Some of the more fancy symbols used for end of story on the other hand might be better implemented as private use characters. For example the use of corporate logos at the end of magazine articles. A./
Re: End of story character
On 1/25/2013 6:52 AM, Joó Ádám wrote: Asmus, I would be happy to hear your opinion on my question, in context of which I may not have been clear on that my intent is not to propose a general character for all uses as end-of-story sign but a well-defined symbol based on both shape and usage pattern (a perfect filled square, appropriately sized based on x-height or whatnot, used as end-of-story sign). The name may well be something more visually descriptive, not necessarily END OF STORY. Á Such a character would be a geometrical symbol. X-HEIGHT SQAURE ON BASELINE might be a descriptive name to distinguish it from other small square symbols that might happen to be in the standard already. Alternatively it might be considered a punctuation character, but the symbol is so generic that giving it a the punctuation semantics seems debatable. But I wouldn't exclude that option. Naming it end of story would imply that it is the only such character, so perhaps end of story square would be more suitable. I am a strong proponent of not unifying geometric shapes (or certain punctuation marks) merely on the ink part of their shape, while disregarding vertical or horizontal placement. Instead, such placement can be significant, and if there is evidence that it relates to differences in usage, I tend to support that as evidence for disunification. A./
Re: End of story character
On 1/25/2013 7:44 AM, Mark E. Shoulson wrote: On 01/25/2013 08:12 AM, Joó Ádám wrote: I don’t know of its use outside of Hungary, but here, as the quote of Halmos suggests, the tombstone is traditionally used in print magazines as end of story. We have adopted it to the web on the Weblabor magazine, where it stands at the end of all blog posts, so the reader knows if it worths to open the story on its own, or the excerpt on the front page was the whole story. We had a problem with U+220E END OF PROOF though, as in most fonts it is a rectangle, while in traditional use it is almost always a perfect square. So we decided to use U+25A0 BLACK SQUARE instead, which has its own problem since it really is oversized for this usage, so we had to mark it up and scale it down. Most of the times I've seen it, it's actually some form of a logo of the magazine in question, or at least a square with the magazine's initial(s) in it. Those all seem to be specialized forms of END OF PROOF to me. It fits the semantics too; a black block at the end of the article. If some magazines use squarer blocks and some more rectangular, that's glyph variation. A good start at a counterexample might be a math journal that uses different-shaped blocks at the ends of its proofs and articles. Still might just be different fonts, but it does start to address it at least. As I point out in another post, the comparison to other conventions points to things like list bullets. Clearly, almost any character (or image) can be used as list bullet. There simply is not a universal list bullet character, although BULLET is a very common character for that purpose. It would be a mistake, in my view, to conceptualize the use, say of a square bullet, as merely a glyph variant of such a universal bullet character. The correct view, in my opinion, is to see these are different spellings of the same general concept, a concept that is therefore not directly expressed on the level of character semantics (just as many other conventions that use characters are not represented directly on the character level - they merely use characters by some sort of convention). End of story markers can also be decorative. A boating magazine might use an anchor or a sailboat silhouette, for example, both representable by existing characters. As a result, the task should reduce to identifying whether there are generically usable symbols that are deployed for end-of-story markers. If these aren't encoded, they could be. (Make that should be), while idiosyncratic symbols should probably not be encoded - and either represented as PUA codes or directly as inline images in rich text. As for representing the end of story semantic in a parseable way, that would be the domain of XML or similar structural markup, it would seem to me. Just because we speak of character semantics doesn't mean that all semantic aspects of a document need to be expressed on that level. A./
Re: Long-term archiving of electronic text documents
On 1/28/2013 5:12 AM, Martinho Fernandes wrote: Similarly, there could be a type of pdf document where the text within the pdf document were stored in UTF-64 format. FWIW, there is already a PDF variant designed for long-term archiving known as PDF/A. You may want to look into that. Good point. Also, note that each new format that is introduced, especially on the character code level, means adding a circle to the big Venn diagram - dividing ALL tools into a huge number of those that cannot handle the format and an initially minuscule number that can. The small spot in the diagram reserved for tools that can handle ALL formats (or at least those desired for archiving) will correspondingly shrink. As a result, instead of improving the activation of electronic text documents you've found a way to make this less reliable - any given combination of recovery tool and format may not work, and the more combinations there are, the lower the probability of success. A./
Re: Long-term archiving of electronic text documents
On 1/28/2013 5:12 AM, Martinho Fernandes wrote: Similarly, there could be a type of pdf document where the text within the pdf document were stored in UTF-64 format. FWIW, there is already a PDF variant designed for long-term archiving known as PDF/A. You may want to look into that. Good point. Also, and that is a reply to William's original suggestion, please note that each new format that is introduced, especially on the character code level, means adding a circle to the big Venn diagram - dividing ALL tools into a huge number of those that cannot handle the format and an initially minuscule number that can. The small spot in the diagram reserved for tools that can handle ALL formats (or at least those desired for archiving) will correspondingly shrink. As a result, instead of improving the activation of electronic text documents you've found a way to make this less reliable - any given combination of recovery tool and format may not work, and the more combinations there are, the lower the probability of success. A./
Re: Long-term archiving of electronic text documents
On 1/28/2013 4:30 AM, William_J_G Overington wrote: The idea is that there would be an additional UTF format, perhaps UTF-64, so that each character would be expressed in UTF-64 notation using 64 bits, thus providing error checking and correction facilities at a character level. I think this proposal is a few weeks early, and that it should be resubmitted on the proper date, but as UTF-256 - for greater redundancy. UTF-256 allows each hex digit of UTF-32 to be expressed as an ASCII hex digit (characters 0-9 and A-F encoded as bytes 0x30-0x39 and 0x41-0x46). This leaves two bits per hex digit unused which could be utilized for bit-level error correction, or you could go to UTF-512 and encode each code twice. The possibilities are endless. A./
Re: External Link (Was: Spiral symbol)
Mark, in my view, the key aspect of the notice cited by Debbie, is the rejection of an external link semantic, which would act as a kind of generic code and could be rendered in many different ways. Instead, the notice leaves open a request to standardize a particular shape, which then could be used as external link symbol by anyone wishing to use that particular shape for that purpose. I happen to believe that the UTC got that one right, but I do see room for encoding a particular shape, if there's a user community behind it. whether based on passive evidence or preferably, in my view, active support. Passive evidence is usually the preferred method for support, but in this case you may well run into a chicken and egg problem, unless you can find, say, a significant set of PDF documents where actual glyphs were used. Active community support might be tricky because, unlike currency symbols or mathematical notation, it's not clear what constitutes a representative user community. However, if a community could be found to whom the preservation of this symbol matters when documents are converted to plain text, then that should help the case. The fact that this keeps bubbling up, is, to me, sign that the notion that this ought to be a character is widespread - that certainly satisfies one of the necessary conditions, but as the UTC notice shows it's not a sufficient condition. A./ On 1/31/2013 3:53 PM, Deborah W. Anderson wrote: Mark, The External Link symbol has been proposed*, you are correct, but it was rejected by the UTC. See the Notice of Non-Approval, dated 06 June 2012: http://www.unicode.org/alloc/nonapprovals.html Debbie Anderson *L2/06-268, L2/12-143, L2/12-169 -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Mark E. Shoulson Sent: Wednesday, January 30, 2013 5:27 PM To: unicode@unicode.org Subject: External Link (Was: Spiral symbol) I found myself the other day looking once again for the character representation of the external link sign so prevalent on Wikipedia and Mathworld and other sites. There has got to be enough evidence for recording something like this. And I've seen a proposal for it too! http://www.unicode.org/review/pr-101.html and the proposal itself at http://www.unicode.org/review/pr-101-06268-ext-link.pdf and proposed by our own Karl Pentzlin back in 2006. What has happened with it since? Still in review? I don't see it on the Pipeline page. Can we revive this proposal, if indeed it needs reviving? I think this character needs encoding. ~mark
Re: External Link (Was: Spiral symbol)
On 1/31/2013 5:55 PM, Mark E. Shoulson wrote: So if a generic external link symbol isn't acceptable, I definitely see reason for at least the adoption of box-with-arrow, possibly *called* EXTERNAL LINK or something. Make that: possibly aliased or annotated as one of the symbols used to indicate an external link. A./
Re: Word reversal from Abobe to Word
How come I'm not surprised to see the problem traced to an RTF format incompatibility. Trying to figure out which parts of the RTF spec to support when is nearly impossible... A./ On 2/7/2013 8:08 AM, Murray Sargent wrote: If you include a {\fonttbl...} entry that defines \f0 as an Arabic font, Word displays it correctly. For example, include {\fonttbl{\f0\fswiss\fcharset177 Arial;}} as in {\rtf1{\fonttbl{\f0\fswiss\fcharset177 Arial;}} \pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE} } This displays as קודמ Murray -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Dreiheller, Albrecht Sent: Thursday, February 7, 2013 7:33 AM To: Raymond Mercier; unicode@unicode.org Subject: RE: Word reversal from Abobe to Word Raymond, If I have a Hebrew text displayed in Adobe Acrobat I can select part of it and can paste it into Word. The trouble is that while individual characters are correctly displayed the order is reversed. Thus if I have in Acrobat קודמ (meaning 'prior') when pasted into Word I get םדוק The Windows clipboard is a multi-channel medium, i.e. several different data formats may be supplied at the same time by the sending application. The receiving application may choose one of these formats. Using a clipboard debugging tool, I see that Word fills up to 18 formats, like 000D Unicode Text (10 Bytes) C090 Rich Text Format (5815 Bytes) C10E HTML Format (3641 Bytes), whereas Adobe fills only 6 formats, e.g. 000D Unicode Text (11 Bytes) C090 Rich Text Format (178 Bytes) In both cases, the Unicode Text format contains the sequence U+05E7, U+05D5, U+05D3, U+05DE in logical order. When paste is used in Word, a high level format is preferred by default, so I suppose the RTF format is the problem here. Word creates an RTF sequence like {\ltrch\fcs1 \af220\afs40\alang1033 \rtlch\fcs0 \f220\fs40\lang1037 \langnp1033\langfenp2052\insrsid13502069\charrsid6162033\'f7\'e5\'e3\'ee}} N.B. \'f7\'e5\'e3\'ee is the CP1255 byte sequence for the Hebrew word above. Adobe produces this RTF sequence: \pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE} which is the right character sequence, but seems to be misunderstood by Word. A solution is to use the Word command Paste contents ... (might be necessary to add it with Customize), and then choose unformatted Unicode text from the format list. Albrecht.
Re: s-j combination in Unicode?
On 2/13/2013 1:59 PM, Andries Brouwer wrote: [Concerning the g-slash, r-slash, eth-slash symbols, they can be coded using U+0337 as g̷ r̷ ð̷. Unicode generally does not decompose slashed symbols - so for example, o-slash does not have a decomposition using U+0337. The UTC may not feel bound by this as a precedent, but it would mean that such encoding could definitely be proposed, and probably should be, to get any decision to decompose these explicitly on the record. A./
Re: s-j combination in Unicode?
On 2/13/2013 1:24 PM, Stephan Stiller wrote: It looks like something that has not been encoded. What is the reason for not having a true combining grapheme joiner, one that overlays graphemes? Or a code point that instructs that the preceding (or following, I guess) code point should be printed at this position but otherwise be treated as having zero width? The reason is that Unicode is not a text layout language. A./
Re: s-j combination in Unicode?
On 2/13/2013 2:58 PM, Buck Golemon wrote: On Wed, Feb 13, 2013 at 2:30 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 2/13/2013 1:24 PM, Stephan Stiller wrote: It looks like something that has not been encoded. What is the reason for not having a true combining grapheme joiner, one that overlays graphemes? Or a code point that instructs that the preceding (or following, I guess) code point should be printed at this position but otherwise be treated as having zero width? The reason is that Unicode is not a text layout language. A./ That addresses his second quesiton, but not the first. Actually --- not. It is intended to address the entire quoted section. A grapheme combining character would only be usable if a normalized combined character was also defined, and the mapping between the combined characters and the un-combined characters with combiner. Where do you get that? In other words adding such a thing wouldn't solve the problem you've posed (adding a combined sj character) since combining characters are (as I understand it) intended to be ephemeral and only fully combined characters are inteneded for communications. That understanding of combining characters does not seem to be backed up by anything in the standard. A./
Re: s-j combination in Unicode?
On 2/13/2013 2:56 PM, Leo Broukhis wrote: On Wed, Feb 13, 2013 at 11:31 AM, Andries Brouwer a...@win.tue.nl wrote: I wondered how to code an s-j overstrike combination in Unicode. I'd write s ZWJ j and use a font that has the appropriate ligature. These features in Unicode aren't intended as just hacks to get the right appearance. The idea is that you can encode the intention of the author more directly. Unless the overstruck sj form happens to be nothing more than fancy presentation of an otherwise normal s, j sequence. A ZWJ doesn't let you indicate whether you want an overstuck form or some other fused form, that choice would reside in the font - making the solution font dependent - which doesn't quite seem the correct approach. Otherwise, why not use the BS control code. In the old days of teletypes that would nicely produce this overstruck effect. No need to define another format character if all you want to do is restore the semantics of that old control character. A./
Re: s-j combination in Unicode?
On 2/13/2013 6:00 PM, Leo Broukhis wrote: Everything dialectology-related is a fancy presentation of the phoneme attribute markup. Well, that's one view. A./ Leo On Wed, Feb 13, 2013 at 5:51 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 2/13/2013 2:56 PM, Leo Broukhis wrote: On Wed, Feb 13, 2013 at 11:31 AM, Andries Brouwer a...@win.tue.nl wrote: I wondered how to code an s-j overstrike combination in Unicode. I'd write s ZWJ j and use a font that has the appropriate ligature. These features in Unicode aren't intended as just hacks to get the right appearance. The idea is that you can encode the intention of the author more directly. Unless the overstruck sj form happens to be nothing more than fancy presentation of an otherwise normal s, j sequence. A ZWJ doesn't let you indicate whether you want an overstuck form or some other fused form, that choice would reside in the font - making the solution font dependent - which doesn't quite seem the correct approach. Otherwise, why not use the BS control code. In the old days of teletypes that would nicely produce this overstruck effect. No need to define another format character if all you want to do is restore the semantics of that old control character. A./
Re: s-j combination in Unicode?
On 2/14/2013 5:38 AM, Andries Brouwer wrote: I asked: : wondered how to code an s-j overstrike combination and learn from Karl Pentzlin about n3555.pdf where Michael Everson proposes U+1E0A2 LATIN SMALL LETTER ESJ (and many other characters). This document is from 2008. What is the status? From the document record, it seems that http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4081.pdf now replaces 3555, but the newer document contains only a subset of the characters. Doc. 3555 was considered during meeting 53 of ISO/IEC JTC1/SC2/WG2 but only reached the state were there was request for feedback. Without digging deeper it appears as if the repertoire that contains the proposed overstrike was not followed up, while the work concentrated on Teuthonista. (See mention of N3555 in http://std.dkuug.dk/JTC1/SC2/WG2/docs/n3703-AI.pdf) Therefore to get these letters encoded would require to resubmit the sections from 3555 that contain them and restart the discussion in UTC and WG2. But I'm sure you'll eventually hear from a direct participant. On Wed, Feb 13, 2013 at 02:24:12PM -0800, Asmus Freytag wrote: On 2/13/2013 1:59 PM, Andries Brouwer wrote: [Concerning the g-slash, r-slash, eth-slash symbols, they can be coded using U+0337 as g̷ r̷ ð̷. Unicode generally does not decompose slashed symbols - so for example, o-slash does not have a decomposition using U+0337. The UTC may not feel bound by this as a precedent, but it would mean that such encoding could definitely be proposed, and probably should be, to get any decision to decompose these explicitly on the record. Yes, o-slash is not decomposed, so is different from o followed by U+0337. But otherwise: are the characters with names starting with COMBINING not intended to be used as combining diacriticals? Wouldn't use such as the above be precisely as intended? Some of the slashes are used, for example, 0338 is used with mathematical symbols for denoting negation. It is just that o-slash, the most widely used representative of the *letters* was never decomposed, so to start now would make the treatment of letters uneven. [However, n3555.pdf also contains U+1E067 LATIN SMALL LETTER ETH WITH STROKE U+1E06E LATIN SMALL LETTER G WITH DIAGONAL STROKE U+1E096 LATIN SMALL LETTER R WITH DIAGONAL STROKE and, e.g., U+1E0AE LATIN SMALL LETTER NASAL Y for y with ogonek. At first sight I do not see the a-ring-ogonek here. Does it occur elsewhere?] You could try to search for it by constructing the likely character name on analogy with existing characters. A./ Andries
Re: s-j combination in Unicode?
On 2/15/2013 11:59 PM, Andries Brouwer wrote: On Fri, Feb 15, 2013 at 10:56:17PM -0600, Ben Scarborough wrote: On Feb 16, 2013 02:13, Andries Brouwer wrote: The fragment of text I showed was not from dialectology, but just from a novel written in Elfdalian. The symbols are meant to be those of ordinary orthography. Does that mean there's also a capital S-J? Probably, in entirely capitalized text. At sentence start I see capitalized I-ogonek, O-ogonek, U-ogonek, Å-ogonek in ordinary text. I have only seen the s-j following d or t, not word-initially. Andries That would make it analogous in a way to German ß. The minute things show up in real orthographies the pressure to handle ALL CAPS exists. The wider use an orhography has, the stronger that pressure is, of course. A./
Re: s-j combination in Unicode?
On 2/16/2013 1:38 AM, Stephan Stiller wrote: That would make it analogous in a way to German ß. The minute things show up in real orthographies the pressure to handle ALL CAPS exists. The question then is whether you'll find SJ or overlaid S/J. Or how a Swede would instinctively handle this, in the absence of an example of a consistently applied rule. There's a question firts, of whether there's a difference between s+j and simple sj. Is it just to mark a different pronunciation of what would be sj in standard Swedish, or are these contrasting in Elfdlalien as well. I suspect that the fallback would be SJ, if nothing else is available, but currently, anybody using s+j would use private fonts and thus there's not necessarily a need to use a fallback. This is different from German use where telegraphs and typewriters were instrumental in creating and cementing the need for a fallback. The German-style fallback is painful enough as it is to make sure it's not Unicode creating the bottleneck. (By the way, for those finding the German rule to write SS unsatisfactory: It's hard to come by an actual minimal pair. MASSE - mass or measurements? See, not hard at all. With the new orthography, ss vs. ß affects the pronunciation of the preceding vowel. It's irritating to see SS because you have to override that rule when you know that the word in lowercase was pronounced differently. And, as Andreas had painstakenly done, you can collect a nearly infinite array of examples where users, in rule-bound Germany(!), simply continue to ignore that rule. A./ PS: And it's not like capitalization is otherwise invertible – the capitalization bits contain information as well, after all.) Besides the point a bit. Even thought it's true that mixed case carries information that's lost in all upper or all lowercase, the issue is a bit different, as not focused on one letter..
Re: s-j combination in Unicode?
On 2/16/2013 7:04 AM, Andries Brouwer wrote: [BTW Is the fact that o-slash is not decomposed not entirely analogous to the fact that i is not decomposed? I would say that neither gives an indication of how symbols involving a combining dot or combining slash are handled in general.] Why don't you just take the precedent as what it is and make your proposal accordingly. Some decisions that went into Unicode could have come out different perhaps, but history says the didn't, and we are stuck with them. Changing horses in mid-stream helps nobody. A./
Re: s-j combination in Unicode?
On 2/16/2013 7:04 AM, Andries Brouwer wrote: I found Diauni.ttf at http://www.thesauruslex.com/typo/dialekt.htm (swedish) http://www.thesauruslex.com/typo/engdial.htm (english) It has landmålsalfabetet at E100-E197 (lower case only) and s-j at E19F, S-J at E1A5, with Y-ogonek, Å-ogonek, G-slash, R-slash, Ð-slash nearby. So you have evidence that the uppercase form is implemented, if not yet a citation of actual use. Since the latter is expected to be rare, I personally would be comfortable with making a code point for it, so that fonts like this, which are actually used, can be mapped to Unicode w/o forcing people into weird fallbacks over a rare character. A./
Re: s-j combination in Unicode?
On 2/16/2013 10:48 AM, Stephan Stiller wrote: the issue is a bit different, as not focused on one letter While we're splitting hairs: Word- or larger-level all-caps /does/ normally make a one-letter difference. When we undo all-caps, one can /normally/ lowercase everything of the word except the first letter. The capitalization bit of that one letter is sometimes unclear. And usually not totally sense-destroying to a human reader with context available. But these fallbacks allow clear misspelled words to appear, not just miscapitalized ones. That's huge. A./
Re: s-j combination in Unicode?
On 2/16/2013 10:48 AM, Stephan Stiller wrote: the issue is a bit different, as not focused on one letter While we're splitting hairs: Word- or larger-level all-caps /does/ normally make a one-letter difference. When we undo all-caps, one can /normally/ lowercase everything of the word except the first letter. The capitalization bit of that one letter is sometimes unclear. Sorry, not what I meant. It can hit any letter of the alphabet. The ß issue hits only one specific letter. A./
Re: German »ß«
On 2/16/2013 12:06 PM, Philippe Verdy wrote: 2013/2/16 Stephan Stiller stephan.stil...@gmail.com: Of course in my worldview, all-caps writing is deprecated :-) This is a presentation style which makes words more readable in some conditions, notably on plates displayed on roads (cities are extremely rarely written in lowercase, as this is more difficult to read from far away when driving). Capitals anyway do not exclude preserving distinctions (so there's a capital Ess-Tsett which preserves the distinction with SS, anc accents are still present, even if they are difficult to distinguish from far away on roads) This may be a French thing. A./ For US, see discussion here: http://www.studio360.org/2011/jan/21/design-real-world/ For Germany, look at http://www.ace-online.de/fileadmin/user_uploads/Der_Club/Presse-Archiv/Bilder/Verkehr/Autobahn/Autobahn_01.jpg or google Autobahnschilder for more PPS: Sweden has quite a bit of UPPERCASE, but seems to use mixed case for some purposes (such as legends on warning signs and minor destinations on road signs). Deprecation only concerns long texts, presented in multiline paragraphs, for which capitals make the text less easy to read.
Re: s-j combination in Unicode?
On 2/16/2013 9:55 PM, Stephan Stiller wrote: from earlier: Otto Scholz Oops, sorry. Otto Stolz. And usually not totally sense-destroying to a human reader with context available. But these fallbacks allow clear misspelled words to appear, not just miscapitalized ones. That's huge. I'm all for a capital version of ß and other such letters, but you may be talking in extremes too much. Never! ;) Actually, the question that started this particular discussion is most likely moot, because the fact that Andries has located at the minimum an existing font implementation of capital S+J. That seems to indicate that, again at the bare minimum, there are other people who think that SJ is not the way to render this. A./
Re: s-j combination in Unicode?
On 2/17/2013 12:30 AM, Stephan Stiller wrote: But I have to ask one more thing: Since the latter is expected to be rare, I personally would be comfortable with making a code point for it, so that fonts like this, which are actually used, can be mapped to Unicode w/o forcing people into weird fallbacks over a rare character. Why would that be so? I thought your normal way of doing things is require attestation of a particular usage. If a character is more frequent, it's more likely we're convinced of its being used in a particular way. Law of diminishing returns. I think it's a waste of everybody's time to even contemplate forcing fallback transformations (which are a pain to program) when perfectly straightforward capital form can be deduced, and has been deduced (at least by font creators - we don't know what user requests they based their work on). Casing irregularities are expensive compared to adding a code point for a rare character. A./
Re: German »ß«
On 2/16/2013 11:19 PM, Julian Bradfield wrote: On 2013-02-17, Philippe Verdy verd...@wanadoo.fr wrote: True lowercase letters are causing problems on road sign indicators on roads with high speed : they are hard to read and if the driver has to look at them for one more second, he does not look at the road. AS I SAID, empirical evaluation by those who had good cause to care about the issue indicates the opposite, that people take longer to read all caps (as is also the case in normal text). This evaluation was done specifically for high speed roads. It included live testing on one motorway. Would not be the first time that Mr. Verdy's statements are in an interesting relation to empirically determined results. :) A./
Re: New Canonical Decompositions to Non-Starters
On 2/17/2013 8:20 AM, Richard Wordingham wrote: Is there any guarantee that U+E4567 will not have a canonical decomposition mapping to U+0F73 TIBETAN VOWEL SIGN II, U+E4568? If so, where is it published? I thought we had guarantees that new canonical decompositions to non-starters would not be created (to U+0F71, U+0F72, U+E4568 in this case), but I cannot find it. This conceivable decomposition mapping appears to wriggle through a loophole because U+0F73 is a starter, i.e. has canonical combining class 0. Richard. Let me see whether I follow that. If you encode a new character, it can have decomposition only if that decomposition also contains at least one new character. Otherwise, you might have existing data that contains that decomposition but wasn't previously normalizable with NFC (and now would be). Now, does it make a difference whether that required new character in the decomposition is the first or the second? (Remember, all decompositions are defined to be pairs, except when they are singletons. If a one-t0-many mapping is desired, enough intermediate, partially composed characters must exist to allow this longer mapping to be represented as a chain of simpler mappings.) And if it does, can one point to a stability guarantee where that is expressed? Is that what you are asking? A./
Re: German »ß«
Nice collection of links, here, Neil. A./ On 2/17/2013 10:52 AM, Neil Harris wrote: On 17/02/13 10:48, Philippe Verdy wrote: I was not citing empirical results but things that are regulated by legislation. And your existing empirical results are just nfomal tests ignoring important parts of the population of drivers... Here are some excellent articles about the evidence-based approach that led to the development of current road signage in the United States. http://www.nytimes.com/2007/08/12/magazine/12fonts-t.html?pagewanted=1_r=0 and this, on research on the legibility of mixed/lower case vs. ALL CAPS: http://www.microsoft.com/typography/ctfonts/WordRecognition.aspx Regarding Clearview and older drivers, this: http://deldot.gov/information/pubs_forms/manuals/de_mutcd/pdf/20080731061923147.pdf is particularly interesting: the take-home quote is this: The greatest improvement in legibility distance afforded by Clearview was realized by older drivers when viewed under headlamp illumination during nighttime conditions (an increase in legibility distance of between 6.0 percent and 6.8 percent) -- Neil
Re: Private Use Area
On 2/18/2013 5:43 AM, Erkki I Kolehmainen wrote: This looks quite clear to me. If I create something and somebody else uses my creation in the intended context, he agrees to my definition. his agreement is private, outside the standard, since the same code points may represent a multitude of different meanings. It may also be the result of a negotiating process within a special purpose user group. William, when you write a standard, you can't avoid the use of technical terms. One of those is the meaning of private used here, as Erkki has so ably explained. You had written: ... about ... private agreement. That is, I feel a somewhat unfortunate way of explaining the situation. You do not need the agreement of anybody to define your assignments in the Private Use Area. Certainly, if someone then wants to use the font and access an alternate glyph then he or she needs to go along with what you have assigned in order to use the font. To me, that sounds like following the documentation of the font rather than being an agreement. In order to interpret the characters - Unicode's term for any operation other than blind transactions, like copying a string, requires you to follow some definition of which code point goes with which encoded character. You correctly note that a font, in a way, provides a private specification (private as seen from the point of view of the original standard, which remains ignorant of it). No user can correctly use your font without that specification, whether you make it available as a document or whether the user reverse engineers it by looking at the font in an editor and recognizing the shapes. By agreeing to follow your specification (and not someone else's) your user now has a private agreement with you. Simple as that. Matters may seem more complex because most software supports, to a degree, a generic treatment of private use characters for the purpose of rendering only. That is, if the rendering requirements consist of left-to-right layout of boxes, with their width defined in the font. then any font using your private assignment of shapes can be rendered without the software needing to be modified. That default treatment is, of course, very useful, but it is not required by the Unicode Standard. In fact, it's an (usually implicit) private specification by the maker of the software, and by designing a font that takes advantage of it, you are now in a private agreement with the software maker. Note that the default treatment for sorting, captilization, and a host of other functions is not going to work for you or most users, for that matter, because, unlike the case of fonts, there's no widely supported data format for submitting details of you to interpret a character outside rendering. A./
Re: German »ß«
Toll, eine dreisprachige Nachricht! Wer macht weiter? A./ On 2/18/2013 10:25 PM, Charlie Ruland wrote: Ne vous moquez pas de monsieur Verdy: il s’ agit là du dernier des Mohicans polymathes ! ☺ Charlie Op zondag 17 februari 2013 schreef Asmus Freytag: Would not be the first time that Mr. Verdy's statements are in an interesting relation to empirically determined results. :) A./
Re: Capitalization in German
On 2/19/2013 9:35 AM, Leif Halvard Silli wrote: Werner LEMBERG, Tue, 19 Feb 2013 10:48:52 +0100 (CET): Otto Stoltz wrote: Here is a minimal pair to illustrate that point: Er hat in Moskau liebe Genossen. Er hat in Moskau Liebe genossen. which translates to: At Moskow, he’s got dear comrades. At Moskow, he has enjoyed love. A classical joke are those two newspaper header lines: Der Gefangene floh Der gefangene Floh which translates to The Prisoner Escaped The Caught Flea And in this case, the prosody in German is *exactly* the same. So in this case, the imaginary newspapers made use of written forms that they perhaps would not have used orally, if instead of newspapers they had been Radio channels. The general subject here is the fact that “outer“ things, such as the (effect of the) “look“ of the language, affects on the “inner“ things, namely how we use the language. In the earlier posts on the readability of road signs there was a link to a paper that reported a research result that is interesting here. People read more slowly when a written form has a non-standard pronunciation (even for well-known words) and faster, when it has standard pronunciation (even for unknown words). The example given was hint/ rint vs. pint. Interesting that. Also, ransom note capitalization is the hardest to read for all forms of capitalization. Take that, CamelCase :) A./
Re: Private Use Area
On 2/19/2013 2:26 PM, Andries Brouwer wrote: On Tue, Feb 19, 2013 at 09:55:09AM +0100, Elbrecht wrote: The academical TITUS project occupied U+E000 thru U+EFFF of the Private Use Area ... The primary Private Use Area is U+E000 .. U+F8FF today. However, Unicode 1.0 defined a Private Use Area U+E800 .. U+FDFF. The Linux keyboard driver uses the Private Use Area as it was at the time of Unicode 1.0 for internal purposes, and assumes that unicode characters have different values. Since Unicode changed its mind, this is no longer true. The very early Unicode versions aren't compatible with later versions (for more than one reason, but ultimately the cause was changes forced upon the design by various parties. If something claims conformance to Unicode 1.0 at this stage, it should be investigated as to whether it isn't overdue for an update... A./
Re: Rendering Raised FULL STOP between Digits
Richard, the situation with the raised decimal point is a mess in Unicode. I know that Mark thinks we have too many dots, but the reason this case is a mess is because the unification with U+002E is both non-workable in practice and runs counter to precedent. The precedent in Unicode is to separately encode characters when they have different appearance, except, if, fundamentally, it's the same character and the difference in appearance can be determined unambiguously by context. There are two primary kinds of context that Unicode admits here. One is based on surrounding text (such as positional forms of Arabic letters). The other is overall stylistic context, such as a font choice (such as upright vs. slanted integral symbols). When the appearance of a character is different based on the author's intent, and two (or more) different appearances can occur in the same document with different significance, then the usual response by Unicode has been to encode explicit characters. (The lot of phonetic characters are full of examples for this, like the lower case a without hook or the g with hook, both of which need to be distinguishable from other forms of these letters in phonetics). So, if a British document can use both inline dots and raised dots, then you can't assign a single font to cover both. Well, the thought was, software might recognize the numeric context. However, as you've pointed out, section numbers are numeric and do not have the raised dot. In fact, as far as such documents are concerned, the raised dot itself can be used by the reader to distinguish decimal numbers from other use of numbers separated by dots (something not possible in other languages that lack this convention). So, on the face of it, the choice to unify the raised decimal dot with 002E violates the encoding model, by pushing semantic distinctions into some kind of markup. On top of that, it's not really practical to expect to have to either mark up all decimal numbers or all section numbers with separate styles or font bindings. That's something not required anywhere else. So far, that's bad enough. Next, you have the issue that Unicode refused (quite properly) to encode a generic decimal separator character, the appearance of which was supposed to vary on external context (like locale or a document global style). This suggestion had been intended to allow numerical expressions to be cut and pasted between documents in different languages with all numbers formatted correctly w/o further editing. That is, the same character would appear as either comma or period (or raised period) depending on context. I wrote that I agreed with the choice to not code such special character for that purpose. However, by not encoding a character for the raised decimal point, Unicode did an about-face and made 002E a limited purpose version of a decimal separator. Suddenly, there is a character that is supposed to have different appearance based on context - on the line for US documents, off the line for British documents. This directly violates the precedent established by the refusal to encode the generic decimal separator. What can be done? I believe the Unicode Standard should be fixed by explicitly removing all suggestions in the text that the raised decimal point is unified with 002E. Second, the standard should be amended by identifying which character is to be used instead for this purpose. It might be something like 00B7. In that case, 00B7 would have to have properties that effectively produce the correct result in numeric context, while leaving non-numeric context unchanged. I believe that is entirely possible, and non-disruptive, insofar as numeric use of 00B7 does not exist for any purpose other than showing a raised decimal point (I suspect there are documents in the wild that already use this character for that purpose). If that alternative is deemed not acceptable, the only remaining choice would be to add a new character. (I would recommend that only as the last resort). A./
Re: Rendering Raised FULL STOP between Digits
On 3/9/2013 1:51 PM, Jukka K. Korpela wrote: 2013-03-09 21:30, Asmus Freytag wrote: I believe the Unicode Standard should be fixed by explicitly removing all suggestions in the text that the raised decimal point is unified with 002E. That would be a good move if agreement can be found on the recommended coding of the middle dot. Second, the standard should be amended by identifying which character is to be used instead for this purpose. It might be something like 00B7. There are several reasons why that would be a bad move. First, 00B7 is a seriously overloaded character already. As is 002E. Overloading characters is not ipso facto a bad thing. The standard precedent in Unicode recognizes the need to primarily support rendering differences that cannot be determined absent markup. In very limited situations are characters of identical rendering behavior repeated on the basis of properties alone. The most common case of this exception is the dual coding of non-breaking characters (space, dash, etc.). A special exception for bidi properties exists for Arabic digits. However, many characters, like dashes and dots, have multiple uses to the human writer and reader, and despite some differences in processing (line breaking etc.) the general approach is to overload the character and let humans (and software) disambiguate it on context - which at least humans can do as long as it renders properly. (The latter is the reason, in my view, why Unicode tends to disunify primarily for rendering). Second, it’s a middle dot, which may differ from a raised dot. Mixed-language documents may well contain both British number notations and occurrences of middle dot in various contexts, and it should be possible to make them appear as different. I would agree with that concern if you could demonstrate, with the usual evidence, that there is a distinction. Note that 8859-1 contains 00B7 at B7 and this will have been used by anyone needing a raised dot and not having a font that magically suppies one on context. (As James and Richard have pointed out, that kind of font technology does not exist, and there seems to be no interest by vendors to supply it - hence underscoring the need for a different character). Due to another unfortunate unification (or semi-unification), 0387 (Greek ano teleia) has been defined as canonical equivalent to 00B7, with the note “00B7 is the preferred character”. This means that glyph design for 00B7 needs to take this into account, and since Greek ano teleia isn’t really a middle dot (rather, an upper dot, appearing roughly around the x-height of a font, rather than at half of x-height, which is a natural position for middle dot). This appears to be another possible mistake. However, the Greek script does provide a context which could be used to select the ano teleia appearance and properties (unless you tell me that the character appears in Greek surrounded by non-Greek alphabet characters). The code chart comment on 002E (full stop) says: “may be rendered as a raised decimal point in old style numbers”. But checking a few fonts that use the OpenType feature for old style numbers (onum), I was unable to find any that has such a glyph selectable that way. Yes, this comment makes no sense. It was a pious wish by the character encoders during the early day of Unicode. It's not been picked up by anyone in 20 years, so far as we know, which means it should be recognized as to what it is: an evolutionary dead branch which needs to be trimmed. I wonder what character and techniques British publishers use to produce notations with a raised dot. Is it 002E, with typographic tools used to raise it, or is it 00B7? I agree, data would help settle this. Richard? I believe that is entirely possible, and non-disruptive, insofar as numeric use of 00B7 does not exist for any purpose other than showing a raised decimal point I’m afraid there is mathematical use of 00B7. It is tempting to use it as a multiplication dot (as in 2 · 2, meaning the same as 2 × 2), especially if you are limited to using ISO Latin 1 repertoire or you find 00B7 essentially simpler to type than 22C5 (dot operator). Standards have been vague or ignorant of the issue (now ISO 8-2 explicitly defines the multiplication dot as 22C5, but I wonder how many people know about this). For mathematical notation, the mathematical publishers are well organized and have agreements on how to handle issues like that (hence the ISO standard). The fact that some individual authors might have used 00B7 as a fallback (or out of ignorance) is not really relevant here. For rendering it's not an issue, and for automatic parsing it's like any other typo. Especially if the middle dot is used as multiplication symbol without spaces around it, confusion would be guaranteed. Human readers don't read the code points. If that alternative is deemed not acceptable, the only
Re: Rendering Raised FULL STOP between Digits
On 3/9/2013 3:41 PM, Philippe Verdy wrote: 2013/3/9 Asmus Freytag asm...@ix.netcom.com: This appears to be another possible mistake. However, the Greek script does provide a context which could be used to select the ano teleia appearance and properties (unless you tell me that the character appears in Greek surrounded by non-Greek alphabet characters). And even this basic rule will be defeated in maths formulas where the MIDDLE DOT 00B7 has been used as a common multiplication operator, along with numbers and variables named after Greek letter. Of course Unicode has now distinctive symbols for maths, but that's another story. RIght, because 22C5 exists for that purpose. There's no reliably way to contextually infer an ano teleia rendering (on the middle of the x-height instead of the middle of the M-height, the intended redering for 00B7 which also works when the middle dot is used as an appended diacritic after the letter L/l in Catalan) where it would break the common appearance between digits with the intended same meaning as a multiplication sign, for text that are not encoded using Maths operators but legacy Greek letters and 00B7.
Re: Rendering Raised FULL STOP between Digits
On 3/9/2013 5:30 PM, Richard Wordingham wrote: On Sat, 09 Mar 2013 14:41:11 -0800 Asmus Freytag asm...@ix.netcom.com wrote: On 3/9/2013 1:51 PM, Jukka K. Korpela wrote: 2013-03-09 21:30, Asmus Freytag wrote: I wonder what character and techniques British publishers use to produce notations with a raised dot. Is it 002E, with typographic tools used to raise it, or is it 00B7? I agree, data would help settle this. Richard? I'm not in the publishing business, but here's what I know. The general feeling seems to be that computers don't do proper decimal points, and so the raised decimal point is dropping out of use. In so far as character coding is involved, the raised decimal point seems to be produced using U+00B7, and I was taken aback by the statement that that was not the correct character. This would not be the first incidence of new writing/printing/processing technology feeding back onto how people write, or layout text or even sort. Whenever new technology becomes pervasive, but doesn't support certain features, it can create pressure to remove them. 'The Lancet' reportedly insists on the use of the raised decimal point (http://www.download.thelancet.com/flatcontentassets/authors/artwork-guidelines.pdf)and gives the instructions 'Type decimal points midline (ie, 23·4, not 23.4). To create a midline decimal on a PC: hold down ALT key and type 0183 on the number pad, or on a Mac: ALT shift 9.' On Windows, that gives U+00B7 MIDDLE DOT. That's sensible advice, in a way, because B7 is in 8859-1 and therefore supported in a huge variety of fonts, for practical purposes, the coverage among non-decorative text fonts is pretty near universal. I've googled for advice on how to produce the raised decimal point. Apart from suggestions to use the a character picker (generally implying U+00B9), recte: 00B7 the only other method I've seen is a TeX package called 'decimal'. It appears to render '.' as the (raised) decimal point and '\.' as the full stop. That's the closest I've found to raising a full stop. Well, in TeX, you can attach style or markup to any input character and there's no explicit reference to any character encoding, because ultimately, tech output gets resolved to a combination of glyphs plus position (that is, you can directly raise pr lower any glyph using a TeX macro, without the need to have font support). Because of that, TeX fonts don't technically need separate glyphs for dots at different relative vertical position from the baseline. Regular fonts might reuse the actual sequence of instructions for drawing the dot, but would still expose separate glyph records containing the different positions. Back in May 1999, John Cowan said on this list 'That is the British decimal-point convention. It can be represented in Unicode plain text with U+00B7 MIDDLE DOT', and no one contradicted him in the thread. Looks like the community voted to not accept the Unicode recommendation for using formatting magic on 002E, so this reinforces the call to remove such recommendations as misleading and contrary to accepted practice. A./
Re: Rendering Raised FULL STOP between Digits
Richard has given some cogent arguments below. Another counter example is the use of : to form abbreviations in Swedish. (It's inserted in the word to replace the elided part). In that use, this punctuation character is suddenly part of a word. To handle the full set of general case, word recognition has to be plenty smart (and context or environment sensitive). The basic, untailored default word breaking algorithm will only ever do the plain vanilla cases right. Basing decisions about encoding of characters on the failings of such simple minded algorithms is really a non-starter. (The few existing exceptions just prove the rule). A./ On 3/9/2013 6:52 PM, Richard Wordingham wrote: On Sat, 09 Mar 2013 16:21:17 -0700 Karl Williamson pub...@khwilliamson.com wrote: Rendering is not the only consideration. Processing textual content for 0387 is broken because it is considered to be an ID_Continue character, whereas its Greek usage is equivalent to the English semicolon, something that would never occur in the middle of a word nor an identifier. ID_Continue is for processing things like variable names. How does allowing U+0387 in variable names cause problems in the processing of text? How would ID_continue allow you to process English «foc’s’le» or «co-operate»? The default word boundary determination has been tailored to give you the right results,and should work for Greek unless you are working with scripta continua, in which case you have massive problems regardless. Note also that word boundary determination is intended to be tailorable, which would allow one to exclude U+00B7 and U+0387 from words or deal with miscoded accents and breathings physically at the start of a word beginning with a capitalised vowel. One should also be able to tailor it to deal with word final apostrophes - though doing that in the CLDR style could be computationally excessive if the text may contain quoting apostrophes. One might even tailor it to allow Greek «ὅ,τι», depending on whether one wishes to count it as a word. Richard.
Re: Rendering Raised FULL STOP between Digits
On 3/9/2013 6:01 PM, Stephan Stiller wrote: 'The Lancet' reportedly insists on the use of the raised decimal point (http://www.download.thelancet.com/flatcontentassets/authors/artwork-guidelines.pdf) and gives the instructions 'Type decimal points midline (ie, 23·4, not 23.4). To create a midline decimal on a PC: hold down ALT key and type 0183 on the number pad, or on a Mac: ALT shift 9.' On Windows, that gives U+00B7 MIDDLE DOT. And in this linked-to document it's raised to only what appears to be half the x-height; I'd raise a multiplicative dot to half the M-height. Philippe's post just now might relate to that in some way. Math operators are usually aligned on the math center line. Wherever that happens to be. However, for fully correct math layout, to require math mode (i.e. global markup selecting math layout) is an appropriate restriction and some minor infidelities in pure plain text rendering of math are therefore tolerable. Mathematical layout has all sorts of little idiosyncratic rules about spacing etc. that are subtly different from regular text, even though many characters can occur in both environments. That's why high-fidelity math layout needs to first identify those areas of a document where math layout rules apply. In TeX that's handled by using $ as an operator, in other environments other conventions (including out of band styling) are used. A./
Re: Rendering Raised FULL STOP between Digits
On 3/9/2013 5:47 PM, Philippe Verdy wrote: 2013/3/10 Asmus Freytag asm...@ix.netcom.com: On 3/9/2013 3:41 PM, Philippe Verdy wrote: 2013/3/9 Asmus Freytag asm...@ix.netcom.com: This appears to be another possible mistake. However, the Greek script does provide a context which could be used to select the ano teleia appearance and properties (unless you tell me that the character appears in Greek surrounded by non-Greek alphabet characters). And even this basic rule will be defeated in maths formulas where the MIDDLE DOT 00B7 has been used as a common multiplication operator, along with numbers and variables named after Greek letter. Of course Unicode has now distinctive symbols for maths, but that's another story. RIght, because 22C5 exists for that purpose. But still, all the other related symbols are multipurpose and cannot be fixed. They are still usable including in maths contexts, even if their rendering is not always adequate for maths (where they may become confusable). But still these other characters should not need to take maths into consideration, so the MIDDLE DOT 00B7 should still behave correctly in Catalan as a diacritic and as a punctuation, and should remain: - between the middle of the M-height and the middle of the x-hieight (for correct display after l/L, or as a punctuation); - but not on the middle of the math line like 22C5 (along with the mathematic MINUS sign and the PLUS sign, the same center used as well for the x-shaped multiplication sign, the division sign... all these maths symbols having more strict presentation constraints). Nothing prevents a mathematical layout program from fine tuning the display of 00B7 if used as a raised decimal point. (see other post). A./
Re: Are there any pre-Unicode 5.2 applications still in existence?
On 3/13/2013 10:25 PM, Peter Constable wrote: I would be inclined to assume that there are Unicode 1.1 apps loitering about. What marks an implementation as version X.y ? If the implementation doesn't support any processing of characters for which there is a mandatory conformance requirement (such as normalization or bidi), then this is difficult indeed. Even then, implementations are free to handle only a partial repertoire and still claim conformance to a given version. (This subsetting may not be permitted for some required operations). That said, there are some specific incompatibilities in character assignment for Unicode 1.1 and earlier, which would allow one to detect a Unicode 1.1 implementation (e.g. of Korean) if it indeed implemented the older character assignments for those cases. A Unicode implementation that passively accepts a character stream and does nothing other than ringing a bell upon accepting a U+0007 character, would be trivially conformant to *any* version of the Unicode Standard. How would we assign this one a version number? Is it a Unicode 1.0? or a Unicode 6.3? or some random version number corresponding to the latest version of the Unicode Standard that happened to be published at the time the application was designed?, compiled?, released? A./ Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Richard Wordingham Sent: March 8, 2013 1:42 PM To: unicode@unicode.org Subject: Re: Are there any pre-Unicode 5.2 applications still in existence? On Fri, 8 Mar 2013 15:54:57 + Costello, Roger L. coste...@mitre.org wrote: Are there any pre-Unicode 5.2 applications still in existence? Strange question! Unicode 5.2 was released in 2009. Consequently, on the Ubuntu release I'm running all characters new in Unicode 5.2 are compared equal (and that nearly bit me - fortunately, the C locale was good enough for my purpose.). The MS Office I have at home on my Windows 7 machine is Office XP (i.e. 2002), and at work we use MS Office 2007 on Windows XP. I supposed it's possible that these versions have been upgraded to a more recent version of Unicode, but I suspect it's unlikely. Richard.
Re: Processing Digit Variants
On the basis of security considerations, it might be necessary to not allow variation selectors to salt strings for parsing. If the string cannot be rejected, then the proper thing might be to parse it as if the variation selectors were not present (on the basis that they do not affect semantics - by design - setting aside Han for the moment, where that story isn't totally clear). Similar considerations would apply to other invisible characters, like redundant directional marks, as well as joiners and non-joiners. Again, if their presence can't be used to reject a string, parsing needs to handle them properly, so that what the user sees is what actually gets parsed. A./ On 3/19/2013 1:45 PM, Richard Wordingham wrote: On Mon, 18 Mar 2013 17:28:30 -0700 Steven R. Loomis s...@icu-project.org wrote: On Monday, March 18, 2013, Richard Wordingham wrote: The issue is rather with emphatically plain text U+0031, U+FE0E, U+0032, U+FE0E. It's the same situation to something like an implementation of LDML number parsing. U+FE0E is not part of a number. I agree that the same arguments are applicable to both parsing and collating, though not necessarily with equal force. Formally, U+0031, U+FE0E, U+0032, U+FE0E seems to be just as much a number as U+FF11 FULLWIDTH DIGIT ONE, U+FF12 FULLWIDTH DIGIT TWO, which the current LDML semantics do treat on an even footing with 12. If the emoji digits had been encoded as new characters, ICU would support them without batting an eyelid. Because the difference does not merit full characterhood, they are encoded by a sequence rather than a single character. Remember, all that U+FE0E does is to request a particular glyph. In a sense, we have 20 new decimal digits, U+0030, U+FE0E to U+0039, U+FE0F and U+0030, U+FE0F to U+0039, U+FE0F. So, why do you consider U+0031, U+FE0E, U+0032, U+FE0E not to be a valid decimal number? 10ZWJ0ZWJ0 would be perfectly reasonable for text likely to be rendered by a cursive Latin font. Identifying such an edge case does not prove that numeric tailoring is broken. An 'edge case' is often just a case that shows that an algorithm that often works has not been thought through thoroughly. Now, as CLDR seems to value speed above perfect correctness, perhaps handling variation sequences will be rejected on that basis. All I was trying to find out on this list was whether U+0031, U+FE0E, U+0032, U+FE0E should be regarded as a proper number. Special characters intended for just one aspect of text processing should not affect other aspects. Unfortunately, a parametric tailoring to ignore irrelevant characters while complying with the UCA is not quite as simple as just ignoring them. The issues arise with the blocking of discontiguous contractions and the possibility that, for example, one might wish to collate character variants differently. On the other hand, ignoring variation selectors by default might be excusable, for they should not occur where they might block canonical reordering (antepenultimate paragraph of TUS 6.2.0 Section 16.4). Richard.
Re: Rendering Raised FULL STOP between Digits
On 3/21/2013 4:22 PM, Philippe Verdy wrote: 2013/3/21 Richard Wordingham richard.wording...@ntlworld.com: Further, the code chart glyphs for the ANO TELEIA and the MIDDLE DOT differ, see attachment. If they are canonically equivalent, and one is a mandatory decomposition of the other, why do they have differing glyphs? Because the codepoints are usually associated with different fonts? For a more striking example, compare the code chart glyphs for U+2F831, U+2F832 and U+2F833, which are all canonically equivalent to U+537F. This is another good example where a semantic variation selector Philippe, let's not go there. Semantic selectors are pure pseudo-coding, because if the semantic differentiation is needed it is needed in plain text - and then it should be expressible in plain character codes. If you need to annotate text with the results of semantic analysis as performed by a human reader, then you either need XML, or some other format that can express that particular intent. Internal to your application you can design a light weight markup format using noncharacters, if you wish, but for portability of this kind of information you would be best off going with something widely supported. The number of conventions that can be applicable to certain punctuation characters is truly staggering, and it seems unlikely that Unicode is the right place to a) discover all of them or b) standardize an expression for them. The problem is, even if you could encode some selectors for certain common cases, the scheme would not be extensible to capture other information that pre-processing (or user input) might have provided and which might be useful to carry around in certain implementations - I'm thinking here that the full spectrum of natural language analysis for word-types might be as interesting as certain individual characters. A./ A./
Re: Rendering Raised FULL STOP between Digits
On 3/22/2013 4:02 AM, Philippe Verdy wrote: 2013/3/22 Asmus Freytag asm...@ix.netcom.com: Semantic selectors are pure pseudo-coding, because if the semantic differentiation is needed it is needed in plain text - and then it should be expressible in plain character codes. We don't disagree, that's exactly what I meant here : plain character codes for expresssing the semantics, even if many renderers or collators will treat them as ignorable. No, we do disagree. The Unicode model is to have all information about character identity (what you call semantics) into the character code, not into a sequence of character and ignorable attributes. Separating the identity of the character into attributes would be something very novel and best not attempted. A./
Re: Rendering Raised FULL STOP between Digits
On 3/22/2013 4:08 AM, Philippe Verdy wrote: 2013/3/22 Asmus Freytag asm...@ix.netcom.com: If you need to annotate text with the results of semantic analysis as performed by a human reader, then you either need XML, or some other format that can express that particular intent. Absolutely NO. If this encodes semantics, this is part of plain text, I think we are on a different page here. In some ways the Unicode term semantics is very misleading in this context. What Unicode means by this fancy term is the character's identity - not it's use. If you use a colon to mark abbreviation (as in Swedish) you are using a colon - the use may be very different from how a colon is used elsewhere, but it does not create a new character. Unicode does not encode the semantics of a sentence or word, but provides a string of characters of known identity that lets a human reader determine the semantics of that sentence or word as unambiguously as if that sentence had been reproduced by analog means - that's, in a nutshell, what Unicode attempts to do. and not part of an upper layer protocol. Notably these characters should be used to alter de default (ambiguous) character properties of the characters they modify, and notably to give them the semantics needed for existing Unicode algorithms (general categories: punctuation, diacritic; word-breaking properties...) Character properties define the *default* behavior of a given character. There are many examples, especially in the context of punctuation where a character can have different uses. Each use may need a different treatment by readers (or algorithms). To handle some behaviors, you may need complex processing (natural language processing) that mimics what a human reader can do. There are a few exceptions where characters are disunified based on properties - the most principled of these involve properties that can't be modified, such as the bidi property. There are about a dozen characters that look entirely alike (by design and derivation) yet have been disunified based on bidi properties - because bidi properties cannot be overridden. There are a few other cases, usually where a character can be both letter and punctuation where such disunifications were made based on overridable properties. Here the reason was that this distinction has such a wide reach (and hat to be applied by many basic algorithms) that breaking the principle of single character identity can be justified. If a problem is sufficiently severe, then you'd possibly have justification to disunify. If not, then the answer would be outside the scope of character encoding. adding new variants of existing characters like what was done specifically for maths is not a stabl long term solution; solutions similar to variant selectors however are much more meaningful, and will allow for example to make the distinction between a MIDDLE DOT punctuation and an ANO TELEIA, and will also allow them to be rendered differently (even if there's no requirement to do so). This is absolutely not pseudo-coding. Pseudo coding refers to making distinctions between characters not on their basic encoding, but by means of attributes such as the selectors you are suggesting.
Re: Rendering Raised FULL STOP between Digits
On 3/22/2013 4:16 AM, Philippe Verdy wrote: 2013/3/22 Asmus Freytag asm...@ix.netcom.com: The number of conventions that can be applicable to certain punctuation characters is truly staggering, and it seems unlikely that Unicode is the right place to a) discover all of them or b) standardize an expression for them. My intent is certainly not to discover and encode all of them. But existing characters are well known for having very common distinct semantics which merit separate encodings. This claim would have to be scrutinized, and, to be accepted, would require very detailed evidence. Also, on what principles would you base the requirement to make a distinction in encoding? And this includes notably their use as numeric grouping separators or decimal separators. Well, the standard currently rules that such use does not warrant separate encoding - and the standard has been consistent about that for the entire 20+ years of its existence. Further, all other character encoding standards have encoded these characters as unified with ordinary punctuation. This is very different from the ANO TELEIA discussion, where an argument could be made that *before* Unicode, the character occurred only in *specific* character sets - and that was a distinction that was lost when these character sets were mapped to Unicode. No such argument exists for either middle dot or raised decimal point (except insofar as you could possibly claim that raised decimal point had never been encoded properly before, but then you'd have to show some evidence for that position). Such common semantic modifiers would be eaiser to support than encoding many new special variants of characters (that won't even be rendered by most applications, and thus won't be used). That might be the case - except that they would introduce a number of problems. Any modifier that has no appearance of its own can get separated from the base character during editing. The huge base of installed software is not prepared to handle an entirely different *kind* of character code, whereas support for simple character additions is something that will eventually percolate through most systems - that fact makes disunifications a much more straightforward process. Some examples : the invisible multiplication sign, the invisible function sign, Nah, these are not modifiers. They stand on their own. Their invisibility is not ideal, but not any worse than word joiner or zwsp. All of these characters are separators - with the difference that the nature of the separator was determined to be crucial enough to encode explicitly. (And of course, reasonable people can disagree on each case). Note that Unicode cloned several characters based on their word-break (or non-break) behavior, which is not a novel idea (earlier character encodings did the same with no break space). Already at that stage the train of having a word break attribute character (what you call a modifier) had left the station. The only way to handle these issues, for better or for worse, is by disunification (wher that can be justified in exceptional circumstances). and even the Latin/Greek mathematical letter-symbols which were only encoded for encoding style differences which have occasional but rare semantic differences. For me, adding those variants was really pseudo-coding, breaking the fundamental encoding model, and complicatin the task for font creators, renderer designers, and increasing a lot the size and complexity of collation tables. Many of these character variants could have been expressed as a base character and some modifier (whose distinct rendering was only optional), allowing a much easier integration and better use. Because of that the UCD is full of many added variants that re alsmost never used and we have to leave with encoded texts that persist in using ambguous characters for the most common possible distinctions. No, for the math alphabetics you would have had to have a modifier that was *not* optional, breaking the variation selector model. There was certainly discussion of a combining bold or combining italic at the time. One of the major reasons this was rejected included the desire to prevent the creation of such operators that could be applied to *every* character in the standard. And, of course, the desire to allow ordinary software to do the right thing in displaying these - the whole infrastructure to handle such modifiers would have been lacking. Further, when you use and italic a in math, you do not need most (or all) software to be aware that this relates to an ordinary a in any way. It doesn't, really, except in text-to-speech conversion or similar, highly specialized tasks. So, unlike variation selectors, there would have been no benefit in using a modifier. A./
Re: Rendering Raised FULL STOP between Digits
On 3/22/2013 12:08 PM, Karl Williamson wrote: On 03/21/2013 04:48 PM, Richard Wordingham wrote: For linguistic analysis, you need the normalisation appropriate to the task. Linguistic analysis (in general) being a hugely complex undertaking, mere normalization pales in comparison, so wrapping normalization into the processing isn't going to make it that much more complicated.. This is a case where Unicode normalisation generally throws away information (namely, how the author views the characters), Canonical normalization is supposed to take care of distinctions that fit within the same view of the character by the author and concern principally distinctions that could be said to be artifacts of the encoding. The same is emphatically NOT true for COMPATIBILITY normalization. whereas in analysing Burmese you may want to ignore the order of non-interacting medial signs even though they have canonical combining class 0. I have found it useful to use a fake UnicodeData.txt to perform a non-Unicode normalisation using what were intended to be routines for performing Unicode normalisation. Fake decompositions are routinely added to the standard ones when generating the default collation weights for the Unicode Collation Algorithm - but there the results still comply with the principle of canonical equivalence. This description seems to capture an implementation technique that could be a shortcut - assuming that normalization wasn't a separate, up-front pass. Some algorithms may have needs to normalize data in ways that might make adding the standard Unicode Normalization aspects into them attractive from a performance point of view (even if not from a maintenance point of view). However, distinguishing U+00B7 and U+0387 would fail spectacularly of the text had been converted to form NFC before you received it. That's a claim for which the evidence isn't yet solid and if it could be made solid would make that claim very interesting. This is the first time I've heard someone suggest that one can tailor normalizations. Handling Greek shouldn't require having to fake UnicodeData.txt. And writing normalization code is complex and tricky, so people use pre-written code libraries to do this. What you're suggesting says that one can't use such a library as-is, but you would have to write your own. I suppose another option is to translate all the characters you care about into non-characters before calling the normalization library, and then translate back afterwards, and hope that the library doesn't use the same non-character(s) internally. Handling Greek in the context of run-of-the-mill algorithms should probably not be done by folding Normalization into them (for the excellent reasons given). But for some performance sensitive and rather complex types of detailed linguistic analysis I might accept the suggestion as a possible shortcut (over a two-pass process). Given the existence of such a shortcut, modifying the normalization part of the combined algorithm is an interesting suggestion as an implementation technique. Tunneling through an existing normalization library would be a hack, which should never be necessary except where normalization is broken (see compatibility Han characters). However, even if standard canonical decompositions can be mistaken, tunneling isn't really a fool-proof answer, because it assumes that data didn't get normalized en route. There's nothing that reliably prevents that from happening in a distributed system (unless all the parts are under your tight control, which would seem make it a distributed system in name only). And the question I have is under what circumstances would better results be obtained by doing this normalization? I suspect that the answer is only for backward compatibility with code written before Unicode came into existence. If I'm right, then it would be better for most normalization routines to ignore/violate the Standard, and not do this normalization. Let's get back to the interesting question: Is it possible to correctly process text that uses 00B7 for ANO TELEIA, or is this fundamentally impossible? If so, under what scenario? A./
Re: Rendering Raised FULL STOP between Digits
On 3/22/2013 6:17 PM, Richard Wordingham wrote: On Fri, 22 Mar 2013 18:01:14 -0700 Asmus Freytag asm...@ix.netcom.com wrote: On 03/21/2013 04:48 PM, Richard Wordingham wrote: However, distinguishing U+00B7 and U+0387 would fail spectacularly of the text had been converted to form NFC before you received it. That's a claim for which the evidence isn't yet solid and if it could be made solid would make that claim very interesting. Distinguishing the character codes will fail trivially. The question is whether analysis or processing of the text will fail spectacularly. The latter is the true test of whether the unification is broken. I, like many others on the various lists would like to see a conisely argued and well documented case made for or against this, using real-world examples. A./ PS: If you quote selectively my text doesn't make any sense. At this point in the message you dropped Karl's reply which is what I am referring to next: However, even if standard canonical decompositions can be mistaken, tunneling isn't really a fool-proof answer, because it assumes that data didn't get normalized en route. Isn't that the key part of what I said above? No it isn't, and even if it was, I was not replying to your words here. A./ Richard.
Re: Rendering Raised FULL STOP between Digits
On 3/23/2013 4:55 AM, Michael Everson wrote: On 23 Mar 2013, at 01:01, Asmus Freytag asm...@ix.netcom.com wrote: Let's get back to the interesting question: Is it possible to correctly process text that uses 00B7 for ANO TELEIA, or is this fundamentally impossible? If so, under what scenario? It is possible to process text without Unicode at all, using sets and sets of 8-bit font-hack fonts. We all did it for years. A bit of a non-sequitur in that whatever may have been done with 8-bit standards doesn't necessarily advance the discussion about how to do things in Unicode. Also, arguably not fully applicable, because the types of processing that could be done with those legacy sets exclude some important real-world scenarios that only Unicode enables... In Unicode, 00B7 and 0387 are canonically mapped, so making distinctions based on code point is not guaranteed to be portable. That's why I singled out 00B7 (not 0xB7, but U+00B7). The question was, given that, Is possible to correctly process text that uses one and the same character code for ano teleia, middle dot, raised decimal point (and fourteen other uses), or is this fundamentally impossible? If so, under what scenario? I think handling raised decimal dot is not any more difficult than recognizing when period is a decimal point (there are some edge cases there that are challenging, but implementations have settled on using period, so that's a done deal). I don't know about the fourteen other uses, but there's been a lot of griping about ano teleia. (That's why I singled out that one, even though II know most of the griping took place in a parallel discussion on another list). I think it would be useful to actually write down an overview of the recommended implementation approach for handling ALL the different uses for middle dot and to make sure that what is recommended is not only theoretically possible, but acceptable and accepted(!) as best practice by implementers, users, and font designers alike. If such a document were to successfully cover all (widely-)known cases, it would make fine material for adding to the character description. If there are holes (things that can't be done - see the question) then it would form a basis on which UTC could make some decisions on how to improve the standard. A./
Re: Rendering Raised FULL STOP between Digits
The question is who would be able to take on the drafting of a document that explains the recommended usage of 00B7 for the various purposes (including recommended ways of getting the correct rendering and processing). ONLY by having such a document, is it possible to be certain that the encoding (now or in future) will not become an obstacle to any of the several usage scenarios. At the moment, the statement that the existing encoding is actually implementable is something that must be considered unproven (enough issues have been pointed out for various elements of the unification already to allow such a conclusion). What we are not getting closer to is a rational understanding of how to improve this situation. Random addition of middle dot characters for some purpose is just as bad as pretending everything is fine with the status quo. I applaud any effort you can make to hold off such additions, but without addressing the larger question we are not getting to a place where we can be confident of what we have (or need). A./ On 3/27/2013 10:56 AM, Michel Suignard wrote: I think it would be useful to actually write down an overview of the recommended implementation approach for handling ALL the different uses for middle dot and to make sure that what is recommended is not only theoretically possible, but acceptable and accepted(!) as best practice by implementers, users, and font designers alike. If such a document were to successfully cover all (widely-)known cases, it would make fine material for adding to the character description. If there are holes (things that can't be done - see the question) then it would form a basis on which UTC could make some decisions on how to improve the standard. Needless to say, the project editor for 10646 who has been pushing the can down the road for EVER concerning the proposed 'A78F LATIN LETTER MIDDLE DOT' would appreciate such production. As RichardW has shown before it is just an additional use case along many other middle dots scenario. It does not seem wise to finalize decision concerning the encoding of another middle dot unless clarification is brought concerning the de facto unification of middle dot, ano telia, and the British decimal point. If some dis-unification is considered all these aspects should be taken into account. Michel
Re: Rendering Raised FULL STOP between Digits
On 3/27/2013 12:07 PM, Philippe Verdy wrote: 2013/3/27 Asmus Freytag asm...@ix.netcom.com: At the moment, the statement that the existing encoding is actually implementable is something that must be considered unproven (enough issues have been pointed out for various elements of the unification already to allow such a conclusion). What we are not getting closer to is a rational understanding of how to improve this situation. Random addition of middle dot characters for some purpose is just as bad as pretending everything is fine with the status quo. We are in fact not discussing random additions but want to handle correctly use cases that are in fact very frequently needed. Ah, what additions are you discussing? For example The Catalan syllable breaker is not a random case, it is in fact highly used and needed as part of its standard orthography (and Catalan is not a minor language, we cannot just ignore it). Are you suggesting the addition of a character for it? There are very frequent uses of the dots, and hyphens which are too much overloaded in their original ASCII-only encoding. same thing about apostrophes/quotes. This causes enough nightmares when trying to parse text, and it's unbelievable that there's no solution to augment the text with either distinct characters, or some varant selectors, or some other format controls to disambiguate these uses, which is really semantic on essential character properties (which are in Unicode since long, like the general category). That's restating well-known issues. Thanks for agreeing. However, let's limit the discussion to dots, otherwise we'll never get any conclusion. For the dashes there are many explicit character that were encoded already, same for the quotes. In those cases, there is often a more readily discernible difference in appearance that made the decision to disunify somewhat easier. The situation for middle dot is both less well understood and less well addressed. The solution based on an upper-layer protocol will not work (for example in filenames, in databases of toponyms, or in archived legal documents whose interpretation should not cause any trouble, including when these documents are converted or exported to many other formats). We are here exactly within the definition of linguistic rules for each language, some of them being highly standardized and which would require a stricter, less ambiguous ebcoding. The time os ASCII only is over, The UCS offers many new unused possibilities, as well as many existing technical solutions, which should not be based just on an heuristic (which will ever break on many cases). Ysers want to get sure that their text will not be misinterpreted, or rendered in an ambiguous or wrong way. Again, a nice general statement. However, it lacks the kind of detail and documented evidence of particular usage that would bring us further at this point. Even if the solutions proposed seem novel this should not block us. And even a novel solution can work in compatibility with the huge existing corpus of texts which will remain ambiguous as they are. The novel encoding solution can perfectly provide a fallback mechanism where it will adopt the old compatibility scheme (similar to ASCII). Of course, nothing will prevent anyone to use characters as they want in random cases, even if this breaks all commonly admitted properties and behaviors. My use of the word random was directed at piecemeal addition of characters. You are using it in a different sense. But this should be distinguished from frequently used cases which have rules formulated since long in wellknown languages (excepr that now the texts have to live in a environement which is more and more multilingual, for which it's not possible to just infer which lalnguage to select to apply its wellknown rules). We have no other solutions than providing explicit hints in the encoded texts (and to forget the time of ASCII-only, except in some technical domains like programming languages and transport/storage protocols which have their own internal syntaxes and which do not qualify relaly as plain text). You've advocated hints or semantic selectors. While a feasible model in principle, I see the main issue in that it would create yet another type of encoding; this is especially troublesome in light of the precedent for quotes and dashes, where there was a careful addition of single-purpose (not overloaded) characters. Unless you can present detailed analysis of the requirements which could be used to prove that ONLY such novel coding construct can handle the needed rendering and processing tasks I would fear it would be difficult to get traction for such a proposal. But that brings me back to my original issue: nobody has done the necessary analysis of the requirements for all (or at least the major) use cases for a mid-level to raised-level dot and pinned down what is or isn't possible in software support (rendering
Re: If Unicode wants to show the Red Card to someone ...
On 4/1/2013 12:19 PM, Buck Golemon wrote: The only remaining question is whether the colors should be represented in the HSL or HSV color space. Go HSV http://www.hsv.de/news/!
Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)
On 4/22/2013 4:27 AM, Charlie Ruland ☘ wrote: * William_J_G Overington [2013/4/22]: [...] If the scope of Unicode becomes widened in this way, this will provide a basis upon which those people who so choose may research and develop localizable sentence technology with the knowledge that such research and development could, if successful, lead to encoding in plane 13 of the Unicode system. I don’t think your problem is “the scope of Unicode” but the size of the community that uses “localizable sentences.” The Unicode Consortium is prepared to encode all characters that can be shown to be in actual use. Please submit a formal proposal that can serve as a basis for further discussion of the topic. I'm afraid that any proposal submitted this way would just become the basis for a rejection with prejudice. Independent of the lack of technical merit of the proposal, the utter lack of support (or use) by any established community would make such a proposal a non-starter. In other words can be shown to be in actual use is an important hurdle that this scheme, however dear to its inventor, cannot seem to pass. The actual bar would actually be a bit higher than you state it. The use has to be of a kind that benefits from standardization. Usually, that means that the use is wide-spread, or failing that, that the character(s) in question are essential elements of a script or notation that, while themselves perhaps rare, complete a repertoire that has sufficient established use. Characters invented for possible use (as in could become successful) simply don't pass that hurdle, even if for example, the inventor were to publish documents using these characters. There are honest attempts, for example, to add new symbols to mathematical notation, which have to wait until there's evidence that they have become accepted by the community before they can be considered for encoding. Mr. Overington is quite aware of what would be the inevitable outcome of submitting an actual proposal, that's why he keeps raising this issue with some regularity here on the open list. A./
Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)
On 4/22/2013 12:35 PM, Stephan Stiller wrote: [Charlie Ruland:] The Unicode Consortium is prepared to encode all characters that can be shown to be in actual use. Are you sure there is a precedent for what is essentially markup for a system of (alpha)numerical IDs? You don't even have to look that far. These inventions utterly fail the actual use test, in the sense that I explained in my other message. I'm always suspicious if someone wants to discuss scope of the standard before demonstrating a compelling case on the merits of wide-spread actual use. A./
Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)
On 4/23/2013 3:00 AM, Philippe Verdy wrote: Do you realize the operating cost of any international standard comittee or for the maintenance ans securization of an international registry ? Who will pay ? Currently we all are paying by having interminable discussions of half-baked ideas foisted onto us. There's a word for this. Time for this discussion to be dropped. A./
Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)
On 4/23/2013 2:01 AM, William_J_G Overington wrote: On Monday 22 April 2013, Asmus Freytag asm...@ix.netcom.com wrote: I'm always suspicious if someone wants to discuss scope of the standard before demonstrating a compelling case on the merits of wide-spread actual use. The reason that I want to discuss the scope is because there is uncertainty. I'm not going to engage on a scope discussion with you, even on this lovely list, without some shred of evidence that there is compelling need. Cheers, A./
Re: Suggestion for new dingbats/symbols
On 5/26/2013 3:15 PM, David Starner wrote: On Sun, May 26, 2013 at 12:40 PM, Andreas Stötzner a...@signographie.de wrote: One of the bodies in the world still ignorant of this fact to the very day is Unicode. Which I feel is a mess. Problems from Unicode generally come from of two places; compatibility with non-Unicode data sets, and people with different goals working on it. Excellent insight. However, both come with the territory of designing a universal character encoding. With a mandate like that, it's difficult to leave any significant user population behind, which forces you to include both the superset of went before and to encompass people with overlapping, but partially divergent goals. Unicode has some characteristics that emerged and took on added importance over time. These include a desire for longevity and stability, which, among other things require that characters, once admitted, must be carried along forever - and that implies that one must be leery of anything that hasn't stood the test of time. Characters fall out of use in the real world all the time, but the ideal for Unicode is to include primarily those that have an ongoing use in archiving and historical study, which in the digital universe might include anything used on a wide enough scale. I sympathize with Andreas' take that the nature and development of modern pictographic writing are rather less well understood than they deserve, and that decisions about encoding are therefore done in partial ignorance of all the facts. Solid scholarly study of the use of signs, symbols and pictographs might help - except that there seem to be no scholars that tackle these from an angle that would ultimately be useful for encoding. I don't believe that is merely a funding problem, but something more fundamental. A./ PS: German uses the same term wissenschaftlich for both scientific and scholarly approaches to knowledge. There are prefixes you can use to narrow things down, but in context, they are often dropped. This, in turn, can lead to confusion because the wrong choice can be made in translation. I don't think there's a natural science of character encoding, and I don't believe that Andreas was really claiming that. Still, there are ways of rigorously studying the phenomenon, an activity that would be considered scholarship.
Re: Suggestion for new dingbats/symbols
On 5/29/2013 1:39 AM, Andreas Stötzner wrote: Am 29.05.2013 um 01:06 schrieb David Starner: And what you'll run into is the fact that people don't agree that that belongs in Unicode. What Andreas was suggesting is rigorous study. I think that is a commendable suggestion. The more interesting question is what aspects should such a study encompass, what are to be its starting points and what kind of conclusions should be possible after it is completed? With better facts in hand it will be much easier to double-check whether currently-held assumptions about their relevance for encoding hold up or need revisiting. Without facts, this kind of discussion just deals in pre-conceived notions, and therefore adds little value. A./
Re: Preconditions for changing a representative glyph?
On 5/29/2013 8:39 AM, Leo Broukhis wrote: I'd like to ask: what is supposed to be the trigger condition for the UTC to consider changing the representative glyph of your favorate symbol here to a novel design? The answer: the purpose of the representative glyph is not to track fashions in representation but to give an easily recognized orthodox shape. In the case of symbols, shape matters differently than for letters (where you have a word context that allows even decorative font shapes to be readable). For symbols, once you leave the canonical shape behind, there's always the argument that what you have is in fact a new symbol. There are some exceptions to this, where notational aspect of symbol use is so strong that variations really function identically and can be unified without issues. This might be the case in your example. However, in general, I would dispute that this is true for non-notational symbols. In the case you give, the new design is clearly not the canonical shape, because it deliberately innovates. If it ever replaces the other sign in a majority of uses (not just in NYC) then perhaps updating the glyph might be appropriate. At this time, we are far from that point. A./
Re: Preconditions for changing a representative glyph?
On 5/29/2013 9:38 AM, Leo Broukhis wrote: On Wed, May 29, 2013 at 9:35 AM, Asmus Freytag asm...@ix.netcom.com mailto:asm...@ix.netcom.com wrote: In the case you give, the new design is clearly not the canonical shape, because it deliberately innovates. If it ever replaces the other sign in a majority of uses (not just in NYC) then perhaps updating the glyph might be appropriate. At this time, we are far from that point. That we are far from that point is clear to me; I was asking if there is a (semi-)formal definition of that point. What is a majority of uses? I think Michael's answer covered that. A./
Re: Preconditions for changing a representative glyph?
On 5/29/2013 9:53 AM, Manuel Strehl wrote: Out of curiosity, has it happened before, that a glyph was updated (i.e., substantially changed) in the standard? Yes, Philippe gives some examples of typical situations. Representative glyphs are not immutable - what is immutable is the identity of the character that is encoded. A change in representative glyph that affects the perception of that identity in an adverse way, must be avoided, but, in reverse, a glyph that leads to misidentification of a character can, and in typical situations, also should be corrected. For symbol, the identity of the character does not necessarily exist independently of its shape. Two similar shapes may exist where each is used only in some context, or where the usage contexts only partially overlap. If that is the case, it should be questioned whether this is really a matter of two representations of the same character, or whether it is the case of two characters that happen to be related. For letters, you have the word context that allows you to resolve the identity question. For symbols, there is no such single, overriding context. A./
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 6/19/2013 6:36 AM, Michael Everson wrote: Only in text which has been decomposed. Not all text gets decomposed. All text may get decomposed without warning. As data is shipped around and processed in various parts of a distributed system, nobody can make any safe assumptions on the normalization state for their data. They may get composed, decomposed, or they may miraculously remain in whatever mixed normalization state they were created in. The point is, any technical argument or design decision that implies that one has control of the normalization state is ipso facto suspect. A./
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 6/19/2013 6:36 AM, Michael Everson wrote: The issue of cedilla can easily be solved at a higher level, font technologies like OpenType can easily display glyphs in Latvian or Livonia and different glyphs for Marshallese. Only in environments which permit language tagging. I'd like Marshallese children to be able to write their language in filenames. Language tagging doesn't seem to be reliable enough to require its use in anything other than high-end typography. A./
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 7/3/2013 2:04 AM, Michael Everson wrote: On 3 Jul 2013, at 09:52, Martin J. Dürst due...@it.aoyama.ac.jp wrote: Quite a few people might expect their Japanese filenames to appear with a Japanese font/with Japanese glyph variants, and their Chinese filenames to appear with a Chinese font/Chinese glyph variants. But that's never how this was planned, and that's not how it works today. Yeah, but CJK is a world of difference away from alphabets of 30-40 characters. That sounds dangerously close to special pleading. And it's a pretty easy guess that there are quite a few more users with Japanese and Chinese filenames in the same file system than users with Latvian and Marshallese filenames in the same file system, both because both Chinese and Japanese are used by many more people than Latvian or Marshallese and because China and Japan are much closer than Latvia and the Marshall Islands. I oppose language-tagging as a mechanism to fix the cock-up of slavishly following 8859 decomposition for cedilla and comma-below. Character encoding is the better way to deal with this. That's the more fundamental point. If comma below and cedilla are really fundamentally different marks, then treating them as such is a principled solution. However, the compromise sounds dangerously like it introduces another one of those irregularities that people will trip over in the future. A./ Michael Everson * http://www.evertype.com/
Re: Scalability of ScriptExtensions (was: RE: Borrowed Thai Punctuation in Tai Tham Text)
On 7/8/2013 1:35 PM, Whistler, Ken wrote: A much more productive approach, it seems to me, would be instead to try to establish information about various, identifiable typographical traditions for use of punctuation around the world, and then associate exemplar sets of punctuation used with those traditions. I would recommend that an approach like that be used behind the scenes to manage the update of the data file. We are stuck with a format that seemingly assumes that all characters are treated individually. However, I agree with you, that this is not the case, but instead, there are these sets of punctuation marks for certain typographical traditions. In addition, there are issues like the Dandas, where specific marks have been unified across a range of related scripts. A flexible way to pull this information together would be a UTN that tries to collect this information in human, not machine readable form, with commentary and background. If the information in the UTN is considered solid, then it could be reflected, in a separate pass, in the existing property file. Because you would work on the basis of either typographical sets (or explicit encoding decisions) there would be less temptation to jiggle individual characters' property values. A./
Re: Scalability of ScriptExtensions
On 7/8/2013 8:15 PM, Richard Wordingham wrote: On Mon, 08 Jul 2013 14:42:15 -0700 Asmus Freytag asm...@ix.netcom.com wrote: We are stuck with a format that seemingly assumes that all characters are treated individually. However, I agree with you, that this is not the case, but instead, there are these sets of punctuation marks for certain typographical traditions. UCD files are intended for computer use. Are you proposing that text rendering systems try to identify the typographical 'tradition' in use? If not, the format seems appropriate for computer use. I'm suggesting that we change the model of how this particular file is maintained, not how the information in it is represented. That was implicit in the part of my reply that you deleted in your answer. In addition, there are issues like the Dandas, where specific marks have been unified across a range of related scripts. And effectively unrelated, like the Latin script. Richard.
Re: symbols/codepoints for necessity and possibility in modal logic
What is wrong with using DIAMOND OPERATOR? A./ On 7/18/2013 8:27 PM, Stephan Stiller wrote: Hi all, Modal logic uses a box and a diamond (this is how they're informally called) as operators (accepting one formula and returning another) to denote necessity and possibility, resp. Older texts might use the letters L and M (resp). Which Unicode codepoints do modal box and diamond correspond to? According to the charts, it seems like the box is ◻ (U+25FB) (is this definitive?), but what about the diamond? Unlike what one might glean from the charts, ⟠ (U+27E0) is afaiu /not/ normally used to denote possibility in the default† sense. Wiki's List of logic symbols article has something to say about this too, but I'm always cautious about information from there. Stephan † eg in the sense of λ푥 . ¬◻¬푥 with ◻ as used in say the axiom schema conventionally named *T* in modal logic