Re: UTF-8 ill-formed question

2012-12-16 Thread Doug Ewell

Philippe Verdy wrote:

If the puprpose of this pocket conversion card is to be used for 
tutorial purpose,


It never was. It was a quick reference guide for experienced users who 
already understood the caveats.


Not worth arguing further.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 





Re: UTF-8 ill-formed question

2012-12-16 Thread Philippe Verdy
But the old Marco design at that time (2002) was still ignoring the Unicode
UTF-8 conformance constraints, as demonstrated in its use of the obsolete
"U-00n" notation (mathcing the obsolete ISO/IETF definition). If the
puprpose of this pocket conversion card is to be used for tutorial purpose,
omitting the validity constraint is not very didactic and could continue to
cause compatibility troubles if theses rules are not exposed and learnt,
and consequently ignored in applications.

Note that in my previous post, I dropped the extra leading zeroes in
Marco's use of the obsolete "U-00n" notation of supplementary
codepoints, but I forgot to change the "U-" prefix into "U+" for these
supplementary code points. Sorry about that.

Of course there are better ways to present this card to something that will
be printed (then placed under a reusable plastic cover, like an identity
card or driver licence card, and the size of a credit card for your
jacket), using HTML or PDF instead of just this basic plain-text format.
The usage instructions on the back side would also be clearer, and there
would be additional visual hints to make it more obvious. And you would be
less restricted for drawing the diagram without using the ugly characters
of box framing symbols (only usable with monospaced fonts which are ugly
for presenring the instructions). The pocket card would also use background
colors to better exhibit an all white frame where you need to write
something (better than using a dot), and what is fixed in the layout.

There are also other possible presentations, if printing a similar tool on
a carton : just use rotating wheels (1 for VW, 1 for X, 1 for Y, you may
ignore the Z wheel which will display the same value in the input and in
the output window) and a front masking carton with windows showing the
input and the result of the conversion ! You don't need any pen, it's
reusable, simpler and faster to use.

2012/12/16 Doug Ewell 

> I remember Marco's original post in 2002. His intent was to give people
> with an actual U+ code point that needed converting—like James Lin ten
> years later—a quick way to do so without getting immersed in all the
> bit-shifting math.
>
> If this were a routine being run by a computer, or a tutorial on UTF-8, I
> would agree that it should have taken loose surrogates into account. But
> it's not. It's just a quick manual reference guide, and loose surrogates
> are 0.0001% of the real-world problem for users like James.
>
> While I note that Philippe's amended version seems straightforward and in
> keeping with Marco's original intent (short and simple), I'd like to
> suggest that neither Marco for creating the original guide, nor anyone else
> for doing up UTF-16 and UTF-32 versions, nor Otto for reposting them on the
> list this week, need to be beaten up any further over this edge case.
>
>
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell ­
>


Re: wrongly identified geometric shape

2012-12-16 Thread philip chastney
On 2012/Dec/08 02:34, Michel Suignard wrote: 
> From:philip chastney
>> anybody converting a document currently using Wingding fonts to one using 
>> Unicode values and Unicode fonts instead, using the transliteration proposed 
>> in N 4384, will find their squares somewhat diminished in size (in this 
>> case, by one third)
>>
>>this is because the terminology used for "size" in N
4384 is at variance with the terminology used heretofore
in UTR 25
>
>
>No such a thing as a Unicode font. We produce the charts using complicated 
>size adjustment and 100s fonts provided by various providers and then anyone 
>is free to create their own. 
I meant the term "Unicode Fonts" as used here:
      http://www.unicode.org/resources/fonts.html 

There is nothing normative about relative size. TR25 does some work at 
classifying these relative sizes and this is in fact explored in detail in 
section 5 of N4384 (that I wrote). N4384 aims at expanding the size set exposed 
in TR25 while staying compatible with its principle.
TUS does not list relative sizes among thenormative behaviours, true, but 
anyone who draws U+2295 CIRCLED PLUS bigger than U+2A01 N-ARY CIRCLED PLUS 
OPERATOR is an idiot, and the font is not compliant with TUS, because the 
character identities have not been preserved  

TUS does not dictate actual sizes, provided the specified relationship between 
glyph sizes is maintained, and that may perhaps be what you meant


 
>Some reality check with common Math fonts show that they tend to use larger 
>size for their geometric shapes than what is presented in the current chart 
>(and in TR25). In fact I am now working in harmonizing the rest of the chart 
>geometric shapes with the Wingdings set and that may result in some size 
>adjustment in future charts. I have been looking at the STIX fonts for 
>example. This would in fact solves the concern expressed here by making 25FC 
>and 25A0 a tad bigger. 
size adjustment of one or two glyphs in an actual font is not an encoding issue

the original msg gave just one example of the sort of anomaly that
results from the introduction, in N 4115, of two entirely
unnecessary distinctions 

the story so far is given in
  www.chastney.com/~philip/shapes/slightly_small_%28revised%29.pdf
  www.chastney.com/~philip/shapes/size_9_centered.pdf
  www.chastney.com/~philip/shapes/N4115_an_alternative_encoding.pdf

the arithmetic involved shouldn't challenge the average 12-year old but, 
because it's unlikely anybody will bother working through it all, check out the 
last page of "N4115_an_alternative_encoding", which shows how Wingdings
shapes can and do, already, fit harmoniously with Table 2.5 from UTR
25

and (assuming "extra large" is not intended to be a graduated size) does so 
without needing to expand the size set exposed in UTR 25

this is because the graduation of sizes has a number of implicit
constraints: 
(i) the "small" size needs to be big enough to be visible at small
point sizes; 
(ii) the "large" size must be less than the font's body height;
(iii) the difference between adjacent sizes needs to be discernible
at, say, 12pt.

this leaves the font designer with just 3 degrees of freedom:
-- the size of the start point
-- the size of the end point
-- the transition from one size to another,
other sizes being obtained by interpolation or extrapolation

if (iv) the "very small" size is somewhere round about the width of
a vertical stem, 
and (v) the "regular" size is somewhere about caps height, 
there's just the transition function to be decided

the transition function might consist only of a
number of different sized steps, but add in the observations that
(vi) the transition function might as well be smooth, and
(vii) given the preponderance of small sizes, a geometric
progression works well,
there isn't a lot left to do, in the way of design

a font like STIX, which uses a number of different sized steps, will
necessarily (because of the implicit constraints) be within a few
%age points of a GP

/phil chastney

Re: UTF-8 ill-formed question

2012-12-16 Thread Doug Ewell
I remember Marco's original post in 2002. His intent was to give people 
with an actual U+ code point that needed converting—like James Lin ten 
years later—a quick way to do so without getting immersed in all the 
bit-shifting math.


If this were a routine being run by a computer, or a tutorial on UTF-8, 
I would agree that it should have taken loose surrogates into account. 
But it's not. It's just a quick manual reference guide, and loose 
surrogates are 0.0001% of the real-world problem for users like James.


While I note that Philippe's amended version seems straightforward and 
in keeping with Marco's original intent (short and simple), I'd like to 
suggest that neither Marco for creating the original guide, nor anyone 
else for doing up UTF-16 and UTF-32 versions, nor Otto for reposting 
them on the list this week, need to be beaten up any further over this 
edge case.


--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 





Re: UTF-8 ill-formed question

2012-12-16 Thread Philippe Verdy
OK then here is the minor change for UTF-8's MPE including the extra row
for strict conformance. This includes the stripping of non-standard leading
zeroes in U+n notations for code points.

(Yes, this is a derived work, I still credit him (but don't want to assume
any additional copyright), and it indicates this is a modified version, and
not the original, and I assume that he published it in a compatible licence
that allowed you to republish its 1.1 version on this list). The extra row
is very similar in its conversion mechansim, except that it explicitly
states the valid codepoints that can be safely converted. And some
abbreviations in the description or in one column header are now fully
expanded for clarity, but beside this, the text is identical:

  Side 1 (print and cut out):

  ╔╦═══╤══╗
  ║U+  ║ yy zz │Cima's║
  ║U+007F  ║ ▼  ▼  │  UTF-8 Magic Pocket Encoder  ║
  ║YZ  ║ .  .  │  Vers. 1.1.1, 16 Dec. 2012   ║
  ╠╫───┼───┐   ╔══╣
  ║U+0080  ║ 3x xy │ 2y zz │  Derived from ║ Hex= ║
  ║U+07FF  ║ 3. .. │ 2. ▼  │ Vers. 1.1 ║ Base ║
  ║   XYZ  ║ .  .  │ .  .  │  30 June 2004 ║   -4 ║
  ╠╫───┼───┼───┐   ║ 0=00 ║
  ║U+0800  ║ 32 ww │ 2x xy │ 2y zz │  M.C. ║ 1=01 ║
  ║U+D7FF  ║ ▼  ▼  │ 2. .. │ 2. ▼  │   ║ 2=02 ║
  ║  WXYZ  ║ E  .  │ .  .  │ .  .  │   ║ 3=03 ║
  ╠╫───┼───┼───┤   ║ 4=10 ║
  ║U+E000  ║ 32 3w │ 2x xy │ 2y zz │   ║ 5=11 ║
  ║U+  ║ ▼  ▼  │ 2. .. │ 2. ▼  │   ║ 6=12 ║
  ║  WXYZ  ║ E  .  │ .  .  │ .  .  │   ║ 7=13 ║
  ╠╫───┼───┼───┼───╢ 8=20 ║
  ║   U-1  ║ 33 0v │ 2v ww │ 2x xy │ 2y zz ║ 9=21 ║
  ║   U-F  ║ ▼  0. │ 2. ▼  │ 2. .. │ 2. ▼  ║ A=22 ║
  ║ VWXYZ  ║ F  .  │ .  .  │ .  .  │ .  .  ║ B=23 ║
  ╠╫───┼───┼───┼───╢ C=30 ║
  ║  U-10  ║ 33 10 │ 20 ww │ 2x xy │ 2y zz ║ D=31 ║
  ║  U-10  ║ ▼  1. │ 2. ▼  │ 2. .. │ 2. ▼  ║ E=32 ║
  ║  WXYZ  ║ F  4  │ 8  .  │ .  .  │ .  .  ║ F=33 ║
  ╚╩═══╧═══╧═══╧═══╩══╝

  Side 2 (print, cut out, and glue on back of side 1):

  ╔═══╗
  ║ Cima's UTF-8 Magic Pocket Encoder - User's Manual ║
  ║ (version 1.1.1, 16 Dec. 2012 - modified from the  ║
  ║ original version 1.1, 2004, by Marco Cimarosti)   ║
  ║   ║
  ║ - Left column: min and max Unicode scalar values: ║
  ║   pick the row that applies to the code point you ║
  ║   want to convert to UTF-8. Letters V..Z mark the ║
  ║   hexadecimal digits that have to be processed.   ║
  ║ - Right column: hexadecimal to base-4 table.  ║
  ║ - Central columns: work area to compute each of   ║
  ║   the 1 to 4 octets that constitute valid UTF-8   ║
  ║   octet sequences.║
  ║   ║
  ║ Convert each digit marked by V..Z from hexadecimal║
  ║ to base-4. Write base-4 digits on the dots placed ║
  ║ under letters v..z (two base-4 digits per hex.║
  ║ digit). Convert 2-digit base-4 number to hex. ║
  ║ digits and write them on the dots on the line.║
  ║ That is your UTF-8 sequence in hexadecimal.   ║
  ║ ▼ Triangular arrow heads show passages that may   ║
  ║ be skipped, either because the digit is   ║
  ║ hard-coded, or because it may be copied directly  ║
  ║ from the scalar value.║
  ╚═══╝



2012/12/16 Otto Stolz 

> Hello,
>
>
> 2012/12/16 Otto Stolz 
>
>> The reason I excluded the surrogates from my UTF-8 MPE
>> was really that I needed additional space for the user’s
>> guide on the reverse side.
>>
>
> Sorry, typo; I meant: “my UTF-16 MPE”. I added that
> extra row (with the branch excluding the surrogates)
> to gain extra space on the reverse sode.
>
> Am 2012-12-16 schrieb Philippe Verdy:
>
>  Add this missing row, Everything in the reverse side can remain the same
>> (or can be using a less "cryptic" compact description of how it works).
>>
>
> I will certainly not change Marco Cimarosti’s original design
> of his UTF-8 MPE.
>
> Best wishes,
>   Otto Stolz
>
>
>
>


Re: UTF-8 ill-formed question

2012-12-16 Thread Otto Stolz

Hello,

2012/12/16 Otto Stolz 

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.


Sorry, typo; I meant: “my UTF-16 MPE”. I added that
extra row (with the branch excluding the surrogates)
to gain extra space on the reverse sode.

Am 2012-12-16 schrieb Philippe Verdy:

Add this missing row, Everything in the reverse side can remain the same
(or can be using a less "cryptic" compact description of how it works).


I will certainly not change Marco Cimarosti’s original design
of his UTF-8 MPE.

Best wishes,
  Otto Stolz





Re: UTF-8 ill-formed question

2012-12-16 Thread Philippe Verdy
2012/12/16 Otto Stolz 

>
> The reason I excluded the surrogates from my UTF-8 MPE
> was really that I needed additional space for the user’s
> guide on the reverse side.
>

Why adding a row in the front side would have not preserved the space for
the reverse side ?
If this is regarded as didactic tool, addin this row would have focused
more on the validity constraint of UTF-8, enforced in TUS and now as well
in the IETF RFC made by ISO to be fully compatible with TUS.

I think that the row was missing only because your MPE was initially
designed for the old UTF-8 definition in the now obsolete ISO definition
where the validity constraint was not clear (it was not clear as well on
past variations of UTF-8 that are still existing in Java (not really for
plain-text interchange but for the 8-native JNI API compatible with 8-bit C
strings, and as part of the serialization format of compiled Java classes).

Add this missing row, Everything in the reverse side can remain the same
(or can be using a less "cryptic" compact description of how it works).


Re: UTF-8 ill-formed question

2012-12-16 Thread Otto Stolz

Hello,

am 2012-12-15 schrieb Philippe Verdy:

But there's still a bug (or request for enhancement) for your Pocket
converters :

- For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates)
from the sets of convertible codepoints.

- But you don't exclude this range in the case of your UTF-8 and UTF-32
"magic encoders" which could forget this case. Of course your encoder would
create distinct sequences for these code points, but they are not valid
UTF-8 or valid UTF-32 encodings.


Only the UTF-16 variant is really *my* “magic pocket encoder” (MPE);
the author is nominated on every one of the three.

I would not demand more from those MPEs than converting
a valid UCS character to a valid, and equivalen, UTF
sequence – and to illustrate the underlying algorithm.
I guess, originally, they were meant as jokes – partially,
at least; I have used them as a didactic device, in my
beginner's lecture in Unicode.

Clearly, Mike Ayers made the point that the UTF-32 encoding
is nothing but a simple shortcut (in the terms of its two
predecessors). His one-row-only MPE expresses this quite
aptly, and any additional branch would spoil the impression.

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.

Cheers,
  Otto Stolz