Re: UTF-8 ill-formed question
Philippe Verdy wrote: If the puprpose of this pocket conversion card is to be used for tutorial purpose, It never was. It was a quick reference guide for experienced users who already understood the caveats. Not worth arguing further. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: UTF-8 ill-formed question
But the old Marco design at that time (2002) was still ignoring the Unicode UTF-8 conformance constraints, as demonstrated in its use of the obsolete "U-00n" notation (mathcing the obsolete ISO/IETF definition). If the puprpose of this pocket conversion card is to be used for tutorial purpose, omitting the validity constraint is not very didactic and could continue to cause compatibility troubles if theses rules are not exposed and learnt, and consequently ignored in applications. Note that in my previous post, I dropped the extra leading zeroes in Marco's use of the obsolete "U-00n" notation of supplementary codepoints, but I forgot to change the "U-" prefix into "U+" for these supplementary code points. Sorry about that. Of course there are better ways to present this card to something that will be printed (then placed under a reusable plastic cover, like an identity card or driver licence card, and the size of a credit card for your jacket), using HTML or PDF instead of just this basic plain-text format. The usage instructions on the back side would also be clearer, and there would be additional visual hints to make it more obvious. And you would be less restricted for drawing the diagram without using the ugly characters of box framing symbols (only usable with monospaced fonts which are ugly for presenring the instructions). The pocket card would also use background colors to better exhibit an all white frame where you need to write something (better than using a dot), and what is fixed in the layout. There are also other possible presentations, if printing a similar tool on a carton : just use rotating wheels (1 for VW, 1 for X, 1 for Y, you may ignore the Z wheel which will display the same value in the input and in the output window) and a front masking carton with windows showing the input and the result of the conversion ! You don't need any pen, it's reusable, simpler and faster to use. 2012/12/16 Doug Ewell > I remember Marco's original post in 2002. His intent was to give people > with an actual U+ code point that needed converting—like James Lin ten > years later—a quick way to do so without getting immersed in all the > bit-shifting math. > > If this were a routine being run by a computer, or a tutorial on UTF-8, I > would agree that it should have taken loose surrogates into account. But > it's not. It's just a quick manual reference guide, and loose surrogates > are 0.0001% of the real-world problem for users like James. > > While I note that Philippe's amended version seems straightforward and in > keeping with Marco's original intent (short and simple), I'd like to > suggest that neither Marco for creating the original guide, nor anyone else > for doing up UTF-16 and UTF-32 versions, nor Otto for reposting them on the > list this week, need to be beaten up any further over this edge case. > > > -- > Doug Ewell | Thornton, Colorado, USA > http://www.ewellic.org | @DougEwell >
Re: wrongly identified geometric shape
On 2012/Dec/08 02:34, Michel Suignard wrote: > From:philip chastney >> anybody converting a document currently using Wingding fonts to one using >> Unicode values and Unicode fonts instead, using the transliteration proposed >> in N 4384, will find their squares somewhat diminished in size (in this >> case, by one third) >> >>this is because the terminology used for "size" in N 4384 is at variance with the terminology used heretofore in UTR 25 > > >No such a thing as a Unicode font. We produce the charts using complicated >size adjustment and 100s fonts provided by various providers and then anyone >is free to create their own. I meant the term "Unicode Fonts" as used here: http://www.unicode.org/resources/fonts.html There is nothing normative about relative size. TR25 does some work at classifying these relative sizes and this is in fact explored in detail in section 5 of N4384 (that I wrote). N4384 aims at expanding the size set exposed in TR25 while staying compatible with its principle. TUS does not list relative sizes among thenormative behaviours, true, but anyone who draws U+2295 CIRCLED PLUS bigger than U+2A01 N-ARY CIRCLED PLUS OPERATOR is an idiot, and the font is not compliant with TUS, because the character identities have not been preserved TUS does not dictate actual sizes, provided the specified relationship between glyph sizes is maintained, and that may perhaps be what you meant >Some reality check with common Math fonts show that they tend to use larger >size for their geometric shapes than what is presented in the current chart >(and in TR25). In fact I am now working in harmonizing the rest of the chart >geometric shapes with the Wingdings set and that may result in some size >adjustment in future charts. I have been looking at the STIX fonts for >example. This would in fact solves the concern expressed here by making 25FC >and 25A0 a tad bigger. size adjustment of one or two glyphs in an actual font is not an encoding issue the original msg gave just one example of the sort of anomaly that results from the introduction, in N 4115, of two entirely unnecessary distinctions the story so far is given in www.chastney.com/~philip/shapes/slightly_small_%28revised%29.pdf www.chastney.com/~philip/shapes/size_9_centered.pdf www.chastney.com/~philip/shapes/N4115_an_alternative_encoding.pdf the arithmetic involved shouldn't challenge the average 12-year old but, because it's unlikely anybody will bother working through it all, check out the last page of "N4115_an_alternative_encoding", which shows how Wingdings shapes can and do, already, fit harmoniously with Table 2.5 from UTR 25 and (assuming "extra large" is not intended to be a graduated size) does so without needing to expand the size set exposed in UTR 25 this is because the graduation of sizes has a number of implicit constraints: (i) the "small" size needs to be big enough to be visible at small point sizes; (ii) the "large" size must be less than the font's body height; (iii) the difference between adjacent sizes needs to be discernible at, say, 12pt. this leaves the font designer with just 3 degrees of freedom: -- the size of the start point -- the size of the end point -- the transition from one size to another, other sizes being obtained by interpolation or extrapolation if (iv) the "very small" size is somewhere round about the width of a vertical stem, and (v) the "regular" size is somewhere about caps height, there's just the transition function to be decided the transition function might consist only of a number of different sized steps, but add in the observations that (vi) the transition function might as well be smooth, and (vii) given the preponderance of small sizes, a geometric progression works well, there isn't a lot left to do, in the way of design a font like STIX, which uses a number of different sized steps, will necessarily (because of the implicit constraints) be within a few %age points of a GP /phil chastney
Re: UTF-8 ill-formed question
I remember Marco's original post in 2002. His intent was to give people with an actual U+ code point that needed converting—like James Lin ten years later—a quick way to do so without getting immersed in all the bit-shifting math. If this were a routine being run by a computer, or a tutorial on UTF-8, I would agree that it should have taken loose surrogates into account. But it's not. It's just a quick manual reference guide, and loose surrogates are 0.0001% of the real-world problem for users like James. While I note that Philippe's amended version seems straightforward and in keeping with Marco's original intent (short and simple), I'd like to suggest that neither Marco for creating the original guide, nor anyone else for doing up UTF-16 and UTF-32 versions, nor Otto for reposting them on the list this week, need to be beaten up any further over this edge case. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: UTF-8 ill-formed question
OK then here is the minor change for UTF-8's MPE including the extra row for strict conformance. This includes the stripping of non-standard leading zeroes in U+n notations for code points. (Yes, this is a derived work, I still credit him (but don't want to assume any additional copyright), and it indicates this is a modified version, and not the original, and I assume that he published it in a compatible licence that allowed you to republish its 1.1 version on this list). The extra row is very similar in its conversion mechansim, except that it explicitly states the valid codepoints that can be safely converted. And some abbreviations in the description or in one column header are now fully expanded for clarity, but beside this, the text is identical: Side 1 (print and cut out): ╔╦═══╤══╗ ║U+ ║ yy zz │Cima's║ ║U+007F ║ ▼ ▼ │ UTF-8 Magic Pocket Encoder ║ ║YZ ║ . . │ Vers. 1.1.1, 16 Dec. 2012 ║ ╠╫───┼───┐ ╔══╣ ║U+0080 ║ 3x xy │ 2y zz │ Derived from ║ Hex= ║ ║U+07FF ║ 3. .. │ 2. ▼ │ Vers. 1.1 ║ Base ║ ║ XYZ ║ . . │ . . │ 30 June 2004 ║ -4 ║ ╠╫───┼───┼───┐ ║ 0=00 ║ ║U+0800 ║ 32 ww │ 2x xy │ 2y zz │ M.C. ║ 1=01 ║ ║U+D7FF ║ ▼ ▼ │ 2. .. │ 2. ▼ │ ║ 2=02 ║ ║ WXYZ ║ E . │ . . │ . . │ ║ 3=03 ║ ╠╫───┼───┼───┤ ║ 4=10 ║ ║U+E000 ║ 32 3w │ 2x xy │ 2y zz │ ║ 5=11 ║ ║U+ ║ ▼ ▼ │ 2. .. │ 2. ▼ │ ║ 6=12 ║ ║ WXYZ ║ E . │ . . │ . . │ ║ 7=13 ║ ╠╫───┼───┼───┼───╢ 8=20 ║ ║ U-1 ║ 33 0v │ 2v ww │ 2x xy │ 2y zz ║ 9=21 ║ ║ U-F ║ ▼ 0. │ 2. ▼ │ 2. .. │ 2. ▼ ║ A=22 ║ ║ VWXYZ ║ F . │ . . │ . . │ . . ║ B=23 ║ ╠╫───┼───┼───┼───╢ C=30 ║ ║ U-10 ║ 33 10 │ 20 ww │ 2x xy │ 2y zz ║ D=31 ║ ║ U-10 ║ ▼ 1. │ 2. ▼ │ 2. .. │ 2. ▼ ║ E=32 ║ ║ WXYZ ║ F 4 │ 8 . │ . . │ . . ║ F=33 ║ ╚╩═══╧═══╧═══╧═══╩══╝ Side 2 (print, cut out, and glue on back of side 1): ╔═══╗ ║ Cima's UTF-8 Magic Pocket Encoder - User's Manual ║ ║ (version 1.1.1, 16 Dec. 2012 - modified from the ║ ║ original version 1.1, 2004, by Marco Cimarosti) ║ ║ ║ ║ - Left column: min and max Unicode scalar values: ║ ║ pick the row that applies to the code point you ║ ║ want to convert to UTF-8. Letters V..Z mark the ║ ║ hexadecimal digits that have to be processed. ║ ║ - Right column: hexadecimal to base-4 table. ║ ║ - Central columns: work area to compute each of ║ ║ the 1 to 4 octets that constitute valid UTF-8 ║ ║ octet sequences.║ ║ ║ ║ Convert each digit marked by V..Z from hexadecimal║ ║ to base-4. Write base-4 digits on the dots placed ║ ║ under letters v..z (two base-4 digits per hex.║ ║ digit). Convert 2-digit base-4 number to hex. ║ ║ digits and write them on the dots on the line.║ ║ That is your UTF-8 sequence in hexadecimal. ║ ║ ▼ Triangular arrow heads show passages that may ║ ║ be skipped, either because the digit is ║ ║ hard-coded, or because it may be copied directly ║ ║ from the scalar value.║ ╚═══╝ 2012/12/16 Otto Stolz > Hello, > > > 2012/12/16 Otto Stolz > >> The reason I excluded the surrogates from my UTF-8 MPE >> was really that I needed additional space for the user’s >> guide on the reverse side. >> > > Sorry, typo; I meant: “my UTF-16 MPE”. I added that > extra row (with the branch excluding the surrogates) > to gain extra space on the reverse sode. > > Am 2012-12-16 schrieb Philippe Verdy: > > Add this missing row, Everything in the reverse side can remain the same >> (or can be using a less "cryptic" compact description of how it works). >> > > I will certainly not change Marco Cimarosti’s original design > of his UTF-8 MPE. > > Best wishes, > Otto Stolz > > > >
Re: UTF-8 ill-formed question
Hello, 2012/12/16 Otto Stolz The reason I excluded the surrogates from my UTF-8 MPE was really that I needed additional space for the user’s guide on the reverse side. Sorry, typo; I meant: “my UTF-16 MPE”. I added that extra row (with the branch excluding the surrogates) to gain extra space on the reverse sode. Am 2012-12-16 schrieb Philippe Verdy: Add this missing row, Everything in the reverse side can remain the same (or can be using a less "cryptic" compact description of how it works). I will certainly not change Marco Cimarosti’s original design of his UTF-8 MPE. Best wishes, Otto Stolz
Re: UTF-8 ill-formed question
2012/12/16 Otto Stolz > > The reason I excluded the surrogates from my UTF-8 MPE > was really that I needed additional space for the user’s > guide on the reverse side. > Why adding a row in the front side would have not preserved the space for the reverse side ? If this is regarded as didactic tool, addin this row would have focused more on the validity constraint of UTF-8, enforced in TUS and now as well in the IETF RFC made by ISO to be fully compatible with TUS. I think that the row was missing only because your MPE was initially designed for the old UTF-8 definition in the now obsolete ISO definition where the validity constraint was not clear (it was not clear as well on past variations of UTF-8 that are still existing in Java (not really for plain-text interchange but for the 8-native JNI API compatible with 8-bit C strings, and as part of the serialization format of compiled Java classes). Add this missing row, Everything in the reverse side can remain the same (or can be using a less "cryptic" compact description of how it works).
Re: UTF-8 ill-formed question
Hello, am 2012-12-15 schrieb Philippe Verdy: But there's still a bug (or request for enhancement) for your Pocket converters : - For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates) from the sets of convertible codepoints. - But you don't exclude this range in the case of your UTF-8 and UTF-32 "magic encoders" which could forget this case. Of course your encoder would create distinct sequences for these code points, but they are not valid UTF-8 or valid UTF-32 encodings. Only the UTF-16 variant is really *my* “magic pocket encoder” (MPE); the author is nominated on every one of the three. I would not demand more from those MPEs than converting a valid UCS character to a valid, and equivalen, UTF sequence – and to illustrate the underlying algorithm. I guess, originally, they were meant as jokes – partially, at least; I have used them as a didactic device, in my beginner's lecture in Unicode. Clearly, Mike Ayers made the point that the UTF-32 encoding is nothing but a simple shortcut (in the terms of its two predecessors). His one-row-only MPE expresses this quite aptly, and any additional branch would spoil the impression. The reason I excluded the surrogates from my UTF-8 MPE was really that I needed additional space for the user’s guide on the reverse side. Cheers, Otto Stolz