Re: Question on U+33D7

2012-02-23 Thread António Martins-Tuválkin
On 2012/2/23 Matt Ma  wrote:

> It is defined as
> "33D7;SQUARE PH;So;0;L; 0050 0048N;SQUARED PH"
> in UnicodeData.txt, but it is shown as "pH" in code chart. Should it be
> "0070 0048" or "PH"?

It should certainly be "pH", i.e., "0070 0048",
because that's the peculiar casing in widespread (universal, really)
use for this basic Chemistry concept (AFAIK it means "power of
Hidrogen"). See < http://en.wikipedia.org/wiki/pH#History >.

While there's no surprise at "PH" Unicode names being all caps, I’m
surprised that the decomposition mapping is wrongly set to 0050 0048
instead of to 0070 0048.

--                                                                  .
António MARTINS-Tuválkin                                           |  ()|
                     Não me invejo de quem tem ||
PT-1500-111 LISBOA                       carros, parelhas e montes      |
+351 934 821 700, +351 212 463 477       só me invejo de quem bebe      |
facebook.com/profile.php?id=744658416    a água em todas as fontes      |
-
De sable uma fonte e bordadura escaqueada de jalde e goles, por timbre a
bandeira, por mote o 1º verso acima, e por grito de guerra "Mi rajtas!".
-




Re: Question on U+33D7

2012-02-23 Thread Asmus Freytag

On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote:

On 2012/2/23 Matt Ma  wrote:


It is defined as
"33D7;SQUARE PH;So;0;L;  0050 0048N;SQUARED PH"
in UnicodeData.txt, but it is shown as "pH" in code chart. Should it be
"0070 0048" or "PH"?

It should certainly be "pH", i.e., "0070 0048",
because that's the peculiar casing in widespread (universal, really)
use for this basic Chemistry concept (AFAIK it means "power of
Hidrogen"). See<  http://en.wikipedia.org/wiki/pH#History>.

While there's no surprise at "PH" Unicode names being all caps, I’m
surprised that the decomposition mapping is wrongly set to 0050 0048
instead of to 0070 0048.


The early fonts and code tables showed this in all caps.

Unfortunately, mappings are frozen - including mistakes.

One of the many reasons not to use NF"K"D or NF"K"C for transforming 
data - these transformations should be limited to dealing with 
identifiers, where practically all of the problematic characters are 
already disallowed.


If your intent is to sort or search a document using "fuzzy" 
equivalences, then you are not required to limit yourself to the NF"K" 
C/D transformations in any way, because you would not be claiming to be 
"normalizing" the text in the sense of a Unicode Normalization Form.


A./




Re: Question on U+33D7

2012-02-23 Thread Ken Whistler

On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote:

It is defined as
>  "33D7;SQUARE PH;So;0;L;  0050 0048N;SQUARED PH"
>  in UnicodeData.txt, but it is shown as "pH" in code chart. Should it be
>  "0070 0048" or "PH"?

It should certainly be "pH", i.e., "0070 0048",
because that's the peculiar casing in widespread (universal, really)
use for this basic Chemistry concept (AFAIK it means "power of
Hidrogen"). See<  http://en.wikipedia.org/wiki/pH#History>.

While there's no surprise at "PH" Unicode names being all caps, I’m
surprised that the decomposition mapping is wrongly set to 0050 0048
instead of to 0070 0048.


O.k., folks, I guess it's time for everybody to gather around the fire 
for another

episode of "Every Character Has a Story".

First, to answer Matt Ma's original question, no, the decomposition 
should *not*
be " 0070 0048". The reason for that is simple: no matter what 
the glyph
looks like, or what people think the character might mean, the 
decomposition mapping
is immutable -- constrained by the stability guarantees for Unicode 
normalization.

U+33D7 had that decomposition mapping as of Unicode 3.1, which defines the
base for normalization stability, so right or wrong, come hell or high 
water, it

stays that way forever.

But that begs the question of how it got to be that way in the first 
place. To answer

that, we have to dig deeper into the history of the encoding.

If you will now pull down your copies of Unicode 1.0 off the shelf and 
turn to p. 362,

you will see that U+33D7 was included in Unicode 1.0. Lo and behold, the
glyph shown in the charts for U+33D7 is "PH", with a capital "P", rather
than a lowercase "p". (The character was also named "SQUARED PH", rather
than the current "SQUARE PH", but the explanation for that will have to wait
for another evening.)

Unicode 1.0 didn't have any formal decompositions, but Unicode 1.*1* did.
In Unicode 1.1, on p. 75, the decomposition for U+33D7 is given as
"[0050] & [0048]", reflecting the glyph shown for the character in 
Unicode 1.0.


It was Unicode 2.0 which changed the glyph for U+33D7 to "pH", on the 
assumption

that the character must have been intended as a East Asian square symbol
representation of the chemical symbol "pH". The decomposition for U+33D7 was
not adjusted, however, although its format was shifted to " + 
0050 P + 0048 H"

in the charts. Now tracking down the details of the decision process that
was involved in changing the glyph for U+33D7 for Unicode 2.0 is pretty
difficult. The development of the suite of fonts for printing Unicode 
2.0 was a pretty
wild and wooly process, as that was the first attempt to print the 
entire set of charts
with outline fonts. Unicode 1.0 had been printed with a bitmap font 
developed
at Xerox in the early early days. Some of the glyph changes between 
Unicode 1.0
and 2.0 "just happened", despite the care which was taken to try to 
check everything.


I'm pretty sure that the glyph change for U+33D7 was discussed by the 
editors
at some point (in either late 1995 or very early 1996), but at that 
stage in the

development of the standard that kind of thing was usually not recorded on
an item-by-item basis. Remember, there was a *lot* going on then which was
much more important to the UTC than the glyph for some East Asian 
compatibility

character that nobody used: the design of UTF-8 for example!

Speaking of use of the character, where *did* it come from exactly, and what
was it intended for? Well, that is also problematical. *Most* of the 
characters

in the CJK Compatibility block in the range U+3380..U+33DD can easily be
traced to KS X 1001:1992 (then known as KS C 5601) or CNS 11643.
But U+33D7, U+33DA, and U+33DB are anomalous. They didn't have any
mappings (that I knew about) as of Unicode 1.0. They may have come from
some early draft of a Korean standard, or from some Asian company private
registry of character extensions, or maybe just from a paper copy of
"character stuff" sitting around at Xerox circa 1989. Nobody really 
seemed to

be sure what they were -- they were just more ill-advised squared East
Asian squared abbreviation "dreck" that was added to the pile and not
examined very carefully, because everybody knew that such symbols for
SI units (and other scientific and math symbols of their ilk, such as 
"ln" for

natural logarithm) should just be spelled out with regular characters.

We can presume, in hindsight, that U+33D7 *may* have been originally 
intended

as an East Asian character set abbreviation symbol for the chemical concept
of "pH". U+33D9 was presumably intended for "parts per million", although
I don't recall that anybody has actually bothered to think about that, 
and if

they had, they might have suggested that the glyph for *that* symbol also
be changed, to the more usual lowercase "ppm". And U+33DA "PR"?
Who knows? My guess would be an abbreviation for "per radian", as
in 57.2957 degrees per radian, but your guess is as good as mi

Re: Question on U+33D7

2012-02-24 Thread Shriramana Sharma
Grandpa grandpa I wanna hear the story about the turtles *now*! :-)

Sent from my Android phone


Re: Question on U+33D7

2012-02-24 Thread Matt Ma
On Fri, Feb 24, 2012 at 5:18 AM, Shriramana Sharma  wrote:
> Grandpa grandpa I wanna hear the story about the turtles *now*! :-)
>
> Sent from my Android phone

Thanks all for the enlightening reply.

My intent was sorting using UCA but it really did not matter much
because U+33D7 was sorted after "PH" in either case ("0050 0048" or
"0070 0048“). I was curious why U+33D7 was defined and stayed that way
in Unicode, and it was answered more than comprehensively.

Regards,
Matt