RE: Ways to show Unicode contents on Windows?

2013-07-09 Thread Murray Sargent
A bulk approach works. The hyperlink gives full instructions on how to set up 
the fonts. You can customize it by changing the fonts listed in default.cfl.

Murray

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Ilya Zakharevich
Sent: Tuesday, July 9, 2013 8:37 PM
To: Unicode Discussion
Subject: Re: Ways to show Unicode contents on Windows?

On Wed, Jul 10, 2013 at 04:24:36AM +, Murray Sargent wrote:
> Ilya asked, " Are there any other ways to show Unicode on Windows?"
> 
> You can download Unibook (http://www.unicode.org/unibook/) and set up your 
> fonts for the ranges. That's the way The Unicode Standard code charts are 
> displayed and printed.

For this, do I need to know which fonts support what?  Or can I just use bulk 
approach and do: for all characters, try these fonts in the given order (as the 
provided in the OP solution-for-FF allows one to do)?

Thanks,
Ilya





RE: Ways to show Unicode contents on Windows?

2013-07-09 Thread Murray Sargent
Ilya asked, " Are there any other ways to show Unicode on Windows?"

You can download Unibook (http://www.unicode.org/unibook/) and set up your 
fonts for the ranges. That's the way The Unicode Standard code charts are 
displayed and printed.

Murray





RE: Word reversal from Abobe to Word

2013-02-08 Thread Murray Sargent
Albrecht notes that

The complete RTF clipboard  content is this, created by "Adobe Acrobat 9 Pro, 
Version 9.5.1":

:  7B 5C 72 74 66 31 5C 61 6E 73 69 5C 61 6E 73 69  {\rtf1\ansi\ansi
0010:  63 70 67 31 32 35 32 5C 75 63 31 20 7B 5C 66 6F  cpg1252\uc1 {\fo
0020:  6E 74 74 62 6C 5C 66 30 5C 66 6E 69 6C 5C 66 63  nttbl\f0\fnil\fc
0030:  68 61 72 73 65 74 31 37 37 20 5C 27 34 31 5C 27  harset177 \'41\'
0040:  37 32 5C 27 36 39 5C 27 36 31 5C 27 36 43 3B 7D  72\'69\'61\'6C;}
0050:  5C 70 61 72 64 5C 70 6C 61 69 6E 5C 71 6C 5C 66  \pard\plain\ql\f
0060:  30 5C 66 73 32 30 20 7B 5C 66 73 35 36 20 5C 75  0\fs20 {\fs56 \u
0070:  31 35 31 31 20 5C 27 46 37 5C 75 31 34 39 33 20  1511 \'F7\u1493
0080:  5C 27 45 35 5C 75 31 34 39 31 20 5C 27 45 33 5C  \'E5\u1491 \'E3\
0090:  75 31 35 30 32 20 5C 27 45 45 7D 7D 00 00u1502 \'EE}}  

Collecting the RTF, we have

{\rtf1\ansi\ansicpg1252\uc1 {\fonttbl\f0\fnil\fcharset177 \'41\'72\'69\'61\'6C;}
\pard\plain\ql\f0\fs20 {\fs56 \u1511 \'F7\u1493\'E5\u1491 \'E3\u1502 \'EE}}

This displays correctly (קודמ) in Word 2013 anyhow. \fcharset177 is correct for 
Hebrew. 

In contrast to what I said below, it also works for \fcharset0, which is 
ANSI_CHARSET and LTR. So I can't get it to display incorrectly. Word treats 
paste a little differently from opening a file, so conceivably there's a 
problem there. Can you save the RTF above as a file and open the file in your 
Word to see how it displays?

Murray




RE: Word reversal from Abobe to Word

2013-02-07 Thread Murray Sargent
In this simple RTF, Word takes the \fN pretty seriously. You need to specify a 
charset with the desired directionality. Word has more sophisticated RTF to 
handle directionality, but without it, you need to define the \fN correctly. 
The idea is that you can overrule the directionality by claiming the script has 
the reverse directionality. This enables Word to write RTF that represents an 
LRO...PDF embedding.

Murray

-Original Message-
From: Asmus Freytag [mailto:asm...@ix.netcom.com] 
Sent: Thursday, February 7, 2013 9:28 PM
To: Murray Sargent
Cc: Dreiheller, Albrecht; Raymond Mercier; unicode@unicode.org
Subject: Re: Word reversal from Abobe to Word

How come I'm not surprised to see the problem traced to an RTF format 
incompatibility. Trying to figure out which parts of the RTF spec to support 
when is nearly impossible...

A./


On 2/7/2013 8:08 AM, Murray Sargent wrote:
> If you include a {\fonttbl...} entry that defines \f0 as an Arabic 
> font, Word displays it correctly. For example, include 
> {\fonttbl{\f0\fswiss\fcharset177 Arial;}}
>
> as in
>
> {\rtf1{\fonttbl{\f0\fswiss\fcharset177 Arial;}}
> \pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 
> \'EE} }
>
> This displays as קודמ
>
> Murray
>
> -Original Message-
> From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
> On Behalf Of Dreiheller, Albrecht
> Sent: Thursday, February 7, 2013 7:33 AM
> To: Raymond Mercier; unicode@unicode.org
> Subject: RE: Word reversal from Abobe to Word
>
>
> Raymond,
>
>> If I have a Hebrew text displayed in Adobe Acrobat I can select part 
>> of it and can paste it into Word. The trouble is that while 
>> individual characters are correctly displayed the order is reversed.
>> Thus if I have
>> in Acrobat
>> קודמ (meaning 'prior')
>> when pasted into Word I get
>> םדוק
> The Windows clipboard is a "multi-channel" medium, i.e. several different 
> data formats may be supplied at the same time by the sending application.
> The receiving application may choose one of these formats.
>
> Using a clipboard debugging tool, I see that Word fills up to 18 
> formats, like 000D  Unicode Text  (10 Bytes)
> C090  Rich Text Format  (5815 Bytes)
> C10E  HTML Format   (3641 Bytes),
> whereas Adobe fills only 6 formats, e.g.
> 000D  Unicode Text   (11 Bytes)
> C090  Rich Text Format (178 Bytes)
>
> In both cases, the Unicode Text format contains the sequence
> U+05E7, U+05D5, U+05D3, U+05DE in logical order.
>
> When "paste" is used in Word, a high level format is preferred by default, so 
> I suppose the RTF format is the problem here.
>
> Word creates an RTF sequence like
> {\ltrch\fcs1 \af220\afs40\alang1033 \rtlch\fcs0   \f220\fs40\lang1037
> \langnp1033\langfenp2052\insrsid13502069\charrsid6162033\'f7\'e5\'e3\'
> ee}}
>
> N.B. \'f7\'e5\'e3\'ee  is the CP1255 byte sequence for the Hebrew word above.
>
> Adobe produces this RTF sequence:
> \pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE} 
> which is the right character sequence, but seems to be misunderstood by Word.
>
> A solution is to use the Word command "Paste contents ..." (might be 
> necessary to add it with "Customize"), and then choose "unformatted Unicode 
> text" from the format list.
>
> Albrecht.
>
>
>
>
>
>





RE: Word reversal from Abobe to Word

2013-02-07 Thread Murray Sargent
"Bing" or "google" the clipboard format string. You'll get the answer in the 
first few hits.

Murray

Sent from my Windows Phone

From: Stephan Stiller
Sent: ‎2/‎7/‎2013 8:51 PM
To: Dreiheller, Albrecht
Cc: Raymond Mercier; 
unicode@unicode.org
Subject: Re: Word reversal from Abobe to Word


The Windows clipboard is a "multi-channel" medium, i.e. several different data 
formats
may be supplied at the same time by the sending application.
The receiving application may choose one of these formats.

Using a clipboard debugging tool, I see that Word fills up to 18 formats

That's valuable information. Where can one find good documentation, and what's 
a good tool that you can recommend?

Stephan



RE: Word reversal from Abobe to Word

2013-02-07 Thread Murray Sargent
If you include a {\fonttbl...} entry that defines \f0 as an Arabic font, Word 
displays it correctly. For example, include {\fonttbl{\f0\fswiss\fcharset177 
Arial;}} 

as in 

{\rtf1{\fonttbl{\f0\fswiss\fcharset177 Arial;}}
\pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE}
}

This displays as קודמ

Murray

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Dreiheller, Albrecht
Sent: Thursday, February 7, 2013 7:33 AM
To: Raymond Mercier; unicode@unicode.org
Subject: RE: Word reversal from Abobe to Word


Raymond,

> If I have a Hebrew text displayed in Adobe Acrobat I can select part 
> of it and can paste it into Word. The trouble is that while individual 
> characters are correctly displayed the order is reversed.

> Thus if I have
> in Acrobat
> קודמ (meaning 'prior')
> when pasted into Word I get
> םדוק

The Windows clipboard is a "multi-channel" medium, i.e. several different data 
formats may be supplied at the same time by the sending application.
The receiving application may choose one of these formats.

Using a clipboard debugging tool, I see that Word fills up to 18 formats, like 
000D  Unicode Text  (10 Bytes)
C090  Rich Text Format  (5815 Bytes)
C10E  HTML Format   (3641 Bytes),
whereas Adobe fills only 6 formats, e.g.
000D  Unicode Text   (11 Bytes)
C090  Rich Text Format (178 Bytes)

In both cases, the Unicode Text format contains the sequence 
U+05E7, U+05D5, U+05D3, U+05DE in logical order.

When "paste" is used in Word, a high level format is preferred by default, so I 
suppose the RTF format is the problem here.

Word creates an RTF sequence like
{\ltrch\fcs1 \af220\afs40\alang1033 \rtlch\fcs0   \f220\fs40\lang1037
\langnp1033\langfenp2052\insrsid13502069\charrsid6162033\'f7\'e5\'e3\'ee}}

N.B. \'f7\'e5\'e3\'ee  is the CP1255 byte sequence for the Hebrew word above.

Adobe produces this RTF sequence:
\pard\plain\ql\f0\fs20 {\fs40 \u1511 \'F7\u1493 \'E5\u1491 \'E3\u1502 \'EE} 
which is the right character sequence, but seems to be misunderstood by Word.

A solution is to use the Word command "Paste contents ..." (might be necessary 
to add it with "Customize"), and then choose "unformatted Unicode text" from 
the format list.

Albrecht.







RE: cp1252 decoder implementation

2012-11-20 Thread Murray Sargent
Phillipe commented: "(even if later Microsoft decides to map some other 
characters in its own "windows-1252" charset, like it did several times and 
notably when the Euro symbol was mapped)".

Personal opinion, but I'd be very surprised if Microsoft ever changed the 1252 
charset. The euro was added back in 1999 when code pages were still used a lot. 
Code pages in general are pretty much irrelevant today except for reading 
legacy documents. They are virtually never used internally in modern software. 
UTF-8,UTF-16, and UTF-32 are what are used these days.

(But code pages do have the advantage that they are associated with specific 
character repertoires, which amounts to a great hint for font binding...)

Murray


RE: Missing geometric shapes

2012-11-08 Thread Murray Sargent
Mark E. Shoulson  wrote: Mirroring tends to be done for glyphs 
that are used in *pairs*, 
open/close things and such.  

Not invariably; consider the integral and summation. They don't have mirrored 
counterparts and many other mathematical symbols don't either.

Murray




RE: User-Hostile Text Editing (was: Unicode String Models)

2012-07-21 Thread Murray Sargent
For math accents, it's easy since the base is the argument of the accent 
operator. But for clusters the standard practice is for the Delete key to 
delete the whole cluster as you note. Also you can't select just part of a 
cluster to save it from deletion. 

I'd think deleting the first character of a cluster would make a nice 
context-menu option. For example, when you right-click on a cluster, the 
resulting context menu could have an entry like "delete first character". Maybe 
other such options could be added as well.

Murray

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Richard Wordingham
Sent: Saturday, July 21, 2012 4:52 PM
To: Unicode
Subject: User-Hostile Text Editing (was: Unicode String Models)

On Fri, 20 Jul 2012 23:16:17 +
Murray Sargent  wrote:

> My latest blog post “Ligatures, Clusters, Combining Marks and 
> Variation 
> Sequences<http://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx>”
> discusses some of these complications.

Are there any widely available ways of enabling the deleting of the first 
character in a default grapheme cluster?  Having carefully added two or more 
marks to a base character, I find it extremely irritating to find I have 
entered the wrong base character and have to type the whole thing again. As one 
can delete the last character in a cluster, why not the first? It's not as 
though the default grapheme cluster is usually thought of as a single character.

Richard.







RE: Unicode String Models

2012-07-20 Thread Murray Sargent
Mark wrote: “I put together some notes on different ways for programming 
languages to handle Unicode at a low level. Comments welcome.”

Nice article as far as it goes and additions are forthcoming. In addition to 
multiple code units per character in UTF-8 and UTF-16, there are variation 
selectors, combining marks, ligatures, and clusters, all of which imply 
handling variable-length sequences even for UTF-32. Handling the variable 
length code points in UTF-8 and UTF-16 is actually considerably easier than 
dealing with these other sources of variable length. For all cases, you need to 
be able to find "character entity" boundaries for an arbitrary code-unit index.

My latest blog post “Ligatures, Clusters, Combining Marks and Variation 
Sequences”
 discusses some of these complications.

One amusing thing is that where I work it’s common to use cp to mean “character 
position”, which more precisely is “UTF-16 code-unit index”, whereas in Mark’s 
post, cp is used for codepoint.

Murray




RE: combining: half, double, triple et cetera ad infinitum

2011-11-14 Thread Murray Sargent
QSJN 4 UKR asks, "Why did the Unicode Consortium think that combination of one 
base character and few combining is possible, and combination of few base 
characters with one combining character is not?
E.g. U+0483 "tilda" has to cover a number. Whole number!"

For mathematical constructs in general, you need to use mathematical 
typography. You can see one way to support this in nearly plain text in Section 
3.10 of Unicode Technical Note #28 
(http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf). To place a 
tilde over a couple of base characters, you can use the wide accents encoded in 
Unicode as Karl Pentzlin pointed out.

And මේ අකුරු ලියා ඇත්තේ යුනිකෝඩ් අකුරෙනි looks like Sinhala text to me.

Murray





RE: Solidus variations

2011-10-07 Thread Murray Sargent
In the linear format of UTN #28, 1/2/3/4 builds up as ((1/2)/3)/4 as in 
computer languages like C. The notation actually started with C semantics and 
then added a larger set of operators, and finally adopted the full Unicode set 
of mathematical operators. You can try it out in Microsoft Office applications. 
Different groupings can be obtained by using parentheses, which may be 
discarded after build up as explained in UTN #28. As Asmus points out, I 
started working on this notation back in the late 1970's and the latest version 
is built into a number of popular products. So it's pretty thoroughly tested.

Murray




RE: Solidus variations

2011-10-07 Thread Murray Sargent
One set of examples of the use of these solidus variations occurs in the 
mathematics linear format described in Unicode Technical Note #28 
(http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf). The ASCII 
solidus (U+002F) described in Section 2.1 is used to represent normal stacked 
fractions. So a/b automatically builds up to a "over" b separated by a 
horizontal fraction bar. The fraction slash (U+2044) is used to input skewed 
fractions as described later in Section 2.1 along with the division slash 
(U+2215), which is used to enter large linear fractions. In this approach, the 
full-width solidus (U+FF0F) is treated as an alias for the ASCII solidus to 
expedite equation entry with East-Asian IMEs. 

U+2215 is a mathematical operator, but the other three appear outside "math 
zones" in ordinary text. U+FF0F is used in contexts where other full-width 
Latin letters are present, e.g., in vertical East-Asian layouts. The fraction 
slash is used to display arbitrary skewed fractions such as ½ when they aren't 
encoded in Unicode. This is a mathematical context, albeit a simple one.

The ASCII solidus is used in various nonmathematical contexts (dates, 
alternatives) and reminds one of the ASCII hyphen-minus (U+002D) which also has 
multiple uses. Unicode has other "slashes" such as the U+27CB RISING DIAGONAL. 
I have a UTC action item to update Unicode Technical Report #25 with some 
discussion about U+27CB, so I'll generalize Section 2.15 "Fraction Slash" of 
that report to compare the usages of the various solidi.

Murray

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Hans Aberg
Sent: Friday, October 07, 2011 7:08 AM
To: Unicode Mailing List
Subject: Solidus variations

There are several solidus (slash) variations. What is the intent of those, in 
as much there been expressed, in a mathematical context?

For example, is U+2044 intended for rational numbers, and U+2215 a long 
variation of U+002F, which can be used to disambiguate a/b/c/d as in a/b∕c/d = 
(a/b)/(c/d)? And is U+FF0F intended for non-math use?

Hans


/ U+002F SOLIDUS
⁄ U+2044 FRACTION SLASH
∕ U+2215 DIVISION SLASH
/ U+FF0F FULLWIDTH SOLIDUS









RE: RTL PUA?

2011-08-22 Thread Murray Sargent
It's actually quite easy to convince Uniscribe to treat specific characters as 
RTL, others as LTR, and, in general, with whatever classifications you desire. 
Pass a preprocessed string to Uniscribe's ScriptItemize(). RichEdit has used 
that approach to some degree starting with RichEdit 3.0 (Windows/Office 2000). 
It's also a handy way to force all operators to be treated as LTR in an LTR 
math zone and as RTL in an RTL math zone (aside from numeric contexts for '.' 
and ','). And you can force IRIs to display LTR or RTL that way by classifying 
the delimiters such as the dots in the domain name accordingly. Some of my blog 
posts on http://blogs.msdn.com/b/murrays/ discuss this in greater detail.

So there's no need to change the properties of the PUA to establish PUA RTL 
conventions. They won't be generally interchangeable, but that's the nature of 
the PUA. You also have to implement such choices using rich/structured text. 
Plain text doesn't have a place to store the necessary properties. Most text is 
rich text anyway .

Murray




RE: Combining Triple Diacritics (N3915) not accepted by UTC #125

2010-11-10 Thread Murray Sargent
You can put diacritics over an arbitrarily large base by using an accent object 
in a math zone. For example, in my email editor (Outlook), I type alt+= to 
insert a math zone and then (a+b)\tilde to get



[cid:image001.png@01CB80BE.389DD340]



(wide tilde over a+b). Evidently linguistic analysis yet another field in which 
mathematical typography is useful.



Murray


<>

RE: number padless?

2010-08-06 Thread Murray Sargent
Type F1 alt+x, where F1 means the letter F key followed by the 1 key, not 
Function key 1. U+00F1 is the Unicode value of ñ. In general to type in a 
character by its Unicode value, type in the hex value and then alt+x. E.g., to 
type in math italic a, type 1D44E alt+x , which gives 𝑎.

Murray


RE: number padless?

2010-08-06 Thread Murray Sargent
In some Microsoft products, e.g., Word, WordPad, OneNote and Outlook, you can 
type ctrl+~ followed by n to get ñ. Or you can type F1 alt+x to get ñ. The 
alt+x conversion of hex Unicode values is easier than the alt+numpad approach, 
since the Unicode Standard is in hex.

Murray

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of ChiGuy
Sent: Friday, August 06, 2010 2:55 PM
To: unicode@unicode.org
Subject: number padless?

Hey all,

Quickie question-
I got a new laptop, but there is no number pad.  Not even one integrated with 
the function keys.
Any idea how I can make special characters for which the number pad is required?
Example:  In Spanish, tomorrow is mañana.  How can I make the enye (code was 
alt-0241, made now with the charmap) with the keyboard?

thanks!


RE: Plain text (was: Re: High dot/dot above punctuation?)

2010-07-28 Thread Murray Sargent
Doug comments:

> Murray Sargent  wrote:

>> It's worth remembering that plain text is a format that was introduced 
>> due to the limitations of early computers. Books have always been 
>> rendered with at least some degree of rich text. And due to the 
>> complexity of Unicode, even Unicode plain text often needs to be 
>> rendered with more than one font.

> I disagree with this assessment of plain text.  When you consider the basic 
> equivalence of the "same" text
> written in longhand by different people, typed on a typewriter, 
> finger-painted by a child, spray-painted
> through a stencil, etc., it's clear that the "sameness" is an attribute of 
> the underlying plain text.  None of
> these examples has anything to do with computers, old or new.

> I do agree that rich text has existed for a long time, possibly as long as 
> plain text (though I doubt that, when
> you consider really early writing technologies like palm leaves), but I don't 
> think that refutes the independent
> existence of plain text.  And I don't think the need to use more than one 
> font to render some Unicode text
> implies it isn't plain text.  I think that has more to do with aesthetics (a 
> rich-text concept) and technical limits
> on font size.

My comments were to some degree hyperbole, in the hope that people fixated on 
plain text would be encouraged to think a little more broadly. Plain text 
underlies all rich text and in that capacity, it's been around since mankind 
started scribing. And plain text can have exotic formatting, e.g., gradient 
color; it's just that the formatting has to be uniform for all the text, rather 
than for parts (runs) of the text. One can regard the need for more than one 
font to render Unicode text as an implementation detail. But as a practical 
matter, it means that rendering/editing engines need to be able to handle a 
fair amount of richness. The RichEdit library used in Windows and Office takes 
advantage of that fact in providing plain-text controls as well as rich-text 
controls.

Murray




RE: High dot/dot above punctuation?

2010-07-28 Thread Murray Sargent
> Michael asks, "Are or will be OT features supported in, say, filenames?" The 
> answer depends on the
> renderer. For example, if you display filenames in NotePad using the Calibri 
> font, default English
> ligatures are used automatically using OpenType table info.

> I meant on the desktop or in the Finder or Explorer.

I don't see them used in the Windows 7 Explorer, but that's no guarantee they 
won't be used in the next version :-) Here I'm assuming you mean OT features 
for English text. OpenType features are used extensively in shaping complex 
script text in general and in complex-script filenames in particular.

Murray




RE: High dot/dot above punctuation?

2010-07-28 Thread Murray Sargent

Michael asks, "Are or will be OT features supported in, say, filenames?" The 
answer depends on the renderer. For example, if you display filenames in 
NotePad using the Calibri font, default English ligatures are used 
automatically using OpenType table info.

Murray







RE: High dot/dot above punctuation?

2010-07-28 Thread Murray Sargent
Asmus asks, "Which implementation makes the required context analysis to 
determine whether 002E is part of a number during layout? If it does make this 
determination, which OpenType feature does it invoke? Which font supports this 
particular OpenType feature?"

I haven't looked to see if our various OpenType engines analyze the context of 
002E to treat numerical contexts in a special way. But both 002E and 002C 
(COMMA) are handled contextually in the build up of UTN #28 linear format 
mathematical expressions and in the rendering thereof. In particular, note that 
in nonnumerical contexts, period and comma are followed by some extra spacing 
in math zones, but as parts of numbers, that extra spacing is omitted. Also in 
RtL math, the period and comma are displayed LtR when part of a number, but RtL 
otherwise. So contextual analysis of these characters is quite important.

Murray








RE: Pashto yeh characters

2010-07-28 Thread Murray Sargent
Andreas Prilop commented "A native speaker of English does not /automatically/ 
know better about English grammar, English punctuation than an informed 
Frenchman." So true, so true. Most native speakers of English have only limited 
understanding of English grammar. At least in my country. They regularly 
confuse she and her, he and him, adverbs and adjectives, etc. Sigh.

Murray






RE: High dot/dot above punctuation?

2010-07-28 Thread Murray Sargent
Contextual rendering is getting to be more common thanks to adoption of 
OpenType features. For example, both MS Publisher 2010 and MS Word 2010 support 
various contextually dependent OpenType features at the user's discretion. The 
choice of glyph for U+002E could be chosen according to an OpenType style. 

It's worth remembering that plain text is a format that was introduced due to 
the limitations of early computers. Books have always been rendered with at 
least some degree of rich text. And due to the complexity of Unicode, even 
Unicode plain text often needs to be rendered with more than one font.

Murray





RE: Why does EULER CONSTANT not have math property and PLANCK CONSTANT does?

2010-07-28 Thread Murray Sargent
Alex notes "Operands are not operators, e.g. in a+b, a and b are operands, + is 
an operator." I'm sure Karl Williamson knows that, but the mathematical 
alphanumerics also aren't operators and they nevertheless have the math 
property. We need to change the description of the math property to include all 
characters that are used primarily for math and the EULER CONSTANT is such a 
character.

Alex.







RE: Generic Base Letter

2010-06-29 Thread Murray Sargent
Vincent asks, "So how does one go about getting buy-in? Are the interested 
parties on this mailing list, or do you have contact information for decision 
makers in the various voting organizations?"

I think you, Khaled, Michael and others have made a very good case for having 
some way to render multiple combining marks on a base character that doesn't 
belong to any particular script. A special character for the purpose may be the 
right way, NBSP may be okay (if people would only implement multiple combining 
marks with it) or generalizing shaping engines to allow combining marks to be 
used with arbitrary bases (Khaled's approach) may be better. We need to 
brainstorm a bunch to figure out how hard things are. I can't promise anything 
other than to say I'll discuss it with my colleagues at Microsoft and the UTC. 
Unfortunately I don't see any easy workaround for you while a solution is being 
pursued. If displaying no base character is adequate, you could use a currently 
valid base character and change its color to white. Then at least you wouldn't 
see a base character, unless it's on a darker background.

Murray






RE: Generic Base Letter

2010-06-28 Thread Murray Sargent
Khaled notes: "There are so many issues with MS implementation(s), for example 
you can not combine any arbitrary Arabic diacritical marks on any given base 
character. I don't think Unicode need to invent workaround broken vendor 
implementations, interested parties should instead pressure on that vendor to 
fix its implementation(s)."

The MS Office math facility allows combining marks in the range U+0300..U+036F 
and most in the range U+20D0..U+20F0 to be applied to any base character(s) 
including complicated mathematical expressions. Such generality is needed in 
mathematics, since tildes, hats, bars, etc., are displayed over multiple base 
characters such as the expression a+b. Hebrew and Arabic combining marks aren't 
currently treated as valid mathematical combining marks, so the sequence U+25CC 
U+05BC U+05B8 doesn't render as Vincent desires in a math zone. It seems 
reasonable to allow all Unicode combining marks as accents in math zones.

Murray




RE: Unicode math examples

2010-06-09 Thread Murray Sargent
Doug asks, " Can anyone point me to some *real-world* examples of mathematics 
text encoded in Unicode, including (especially) the Mathematical Alphanumeric 
Symbols starting at U+1D400?"

Here are two documents with such text:

Unicode Technical Report #25 "Unicode Support for Mathematics" 
(http://www.unicode.org/reports/tr25/)
Unicode Technical Note #28 "Unicode Nearly Plain-Text Encoding of Mathematics" 
(http://www.unicode.org/notes/tn28/).

Murray




RE: Unicode Ruby

2004-12-19 Thread Murray Sargent
Couple of notes on Word's support. Word has been based on Unicode since
Word '97, although it certainly didn't support all of Unicode at that
time. Word has displayed ruby in built-up form for several versions now
(the name for it is under Asian formatting and called "phonetic guide").

Murray 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Dean Snyder
Sent: Saturday, December 18, 2004 11:53 AM
To: Unicode List
Subject: Unicode Ruby

Can anyone recommend common and/or cross-platform technologies that
render Unicode ruby text in ways other than simply enclosing it within
trailing parentheses (in other words, technologies that would place it
above the annotated text and in a smaller font size, as is typically
done traditionally)? By technologies I'm thinking of things like
internet browsers, email clients, word processors, desktop publishing
programs, computer operating systems, and cross-platform programming
platforms (like Java).

-

So far I've only checked on Mac OS X 10.3.6 and found the following:

Browsers
Safari, Firefox, Internet Explorer, and OmniWeb all display Unicode ruby
in parentheses. (Internet Explorer does, however, display *HTML* ruby
above and smaller.)

Word Processors
Nisus Writer Express, Mellel, and TextEdit all display Unicode ruby in
parentheses. I don't know about the latest Microsoft Word, which I
understand does some Unicode, but the previous one, of course, didn't
even do Unicode at all.

Email Clients
PowerMail and Apple Mail use parentheses.

Desktop Publishing/Graphics
Adobe InDesign, Photoshop, and Illustrator (all CS) use parentheses.

Computer Operating Systems
I can only assume, based on application behavior, that the Mac OS X
default rendering of Unicode ruby is with parentheses, but I haven't
explicitly checked the API documentation for this yet. I am not
qualified to comment on Windows XP or Linux.

Java
I haven't checked the various Java virtual machines yet, but plan to do
so.

I would be very interested if anyone could provide similar information
for Windows and Linux.

---

Frankly I am disappointed with the results so far. It seems like
everyone has taken the easy and ugly way out. I'm particularly surprised
and disappointed by InDesign, a great page layout and desktop publishing
application. But I would also have thought that the browsers would have
been motivated to do better; even Internet Explorer, which shows that it
is both doable and desirable by implementing it for html ruby, punted
when it came to Unicode ruby.

Isn't this basically just unacceptable for Japanese readers? Do we
really put out computer operating systems localized for Japanese users
without OS support for super-posed ruby?

Anyway, my interest is in applying the ruby mechanism to cuneiform text,
where, similar to Japanese, there is a one-to-many relationship between
any given single (ideographic) character and its many possible context-
free realizations. It would be important not to clutter the visual
cuneiform text with roman-transliterations in parentheses after every
character.

I know custom software can handle ruby any way it wants to, and I am
working on such software, but at the same time it is very important that
operating systems and major software do the right thing here - users do
not want to keep their text isolated in custom applications. And,
anyway, shouldn't this already be in place and ubiquitous given the
importance of properly supporting the Japanese script?

---

An interesting aside: it is particularly felicitous to note that the
typical practice of rendering ruby text in smaller font sizes than the
text it annotates happens to be a PERFECT match for the needs of
rendering annotated cuneiform plain text. All one needs to do is to look
at the visual complexity of cuneiform glyphs to realize that, in order
to be distinguishable on foreseeable display technologies, cuneiform
glyphs need to be rendered in relatively larger font sizes than, say,
Roman text. And exactly analogous to the Japanese situation, the
secondary glyphs used for annotation of cuneiform happen to be
glyphically simpler that the primary glyphs thereby permitting the
reduction in size that emphasizes their secondary nature. A nice
coincidence the benefits of which cuneiformists will simply inherit - no
work request will be added to anybody's agenda (any implementor that
does the right thing for Japanese will, by definition, be doing the
right thing for cuneiform).
It's always nice when such unforeseen things happen.


Respectfully,

Dean A. Snyder

Assistant Research Scholar
Manager, Digital Hammurabi Project
Computer Science Department
Whiting School of Engineering
218C New Engineering Building
3400 North Charles Street
Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 410 516-6850
cell: 717 817-4897
www.jhu.edu/digital

RE: Wide Characters in Windows and UTF16

2004-08-11 Thread Murray Sargent
Wide characters in Windows 2K and XP are used for UTF-16 for most
programs that I know of including the Microsoft Office suite and OS
programs such as NotePad and WordPad. Windows 9x has limited Unicode
support, but many programs do use wide characters for UTF-16 on Windows
9x as well.

Murray 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Abhishek Agrawal
Sent: Wednesday, August 11, 2004 2:08 AM
To: [EMAIL PROTECTED]
Subject: Wide Characters in Windows and UTF16

Hi,

I am new member of this mailing list. I have browse the web extensivly
to find out if "Wide characters in Windows(9x, 2k, XP) is subset of
UTF16 in Linux without difference in endianness".  

I have tried almost  100 sites till now on this topic out which best one
is following
http://www.google.co.in/search?q=cache:9C4Hm-SUytAJ:developer.r-project.
org/Encodings_and_R.html+windows+wide+characters+ucs2&hl=en

Thanking you in advance for your help.

regards,
Abhishek





RE: Surrogates in WordPad

2004-02-01 Thread Murray Sargent
Title: Surrogates in WordPad






Type the UTF-32 code for the 
character instead of the surrogate pair. For example to get a math italic i, 
type 1D456 and then Alt+x. Lone surrogate codes aren't desirable. RichEdit does 
allow the high code to come in alone via the WM_CHAR message, since some IMEs 
can only work with 16 bits at a time. WM_UNICHAR takes the UTF-32 code as 
well.
 
Murray


From: [EMAIL PROTECTED] on behalf of 
Laurentiu IancuSent: Sat 1/31/2004 5:47 PMTo: 
[EMAIL PROTECTED]Subject: Surrogates in 
WordPad

Hello,In Windows 2000 WordPad I can type a high 
surrogate code, press Alt+X, thentype a low surrogate code, press Alt+X 
again and obtain the correspondingcharacter.  This does not seem to 
work in Windows XP WordPad.  Is there a wayto enable this functionality 
in XP WordPad?My guess is that the new RichEdit control might not accept 
surrogates in orderto prevent isolated surrogates from being entered.  
I would appreciate anyqualified comments.Thank 
you,Laurentiu




Does Java 1.5 support Unicode math alphanumerics as variable names?

2004-01-23 Thread Murray Sargent
Title: Does Java 1.5 support Unicode math alphanumerics as variable names?






E.g., math italic i (U+1D456)? With such usage, Java mathematical programs could look more like the original math.


Thanks

Murray





RE: Code points on Windows

2004-01-14 Thread Murray Sargent
Raymond Mercier wrote: "In MS Word if you type the Unicode code point,
followed by Alt-X, you get the character (if you have the font). This
works in reverse.  Sometimes in a RichEdit control window it will work
in the first direction, but not in reverse.  
It does not work in Wordpad, in spite of its use of RichEdit. I don't
know why not."

WordPad on Windows 2000 and XP support Alt+x. Win95 and Win98 WordPads
don't, since they used earlier RichEdit's than version 3.0. Version 3.0
doesn't have the toggle: Alt+x converts a hex code string to the Unicode
character; Alt+X does the reverse. Word 2002 added the Alt+x facility
with the nice wrinkle of making it a toggle. Accordingly RichEdit 4.0
(used in Office 2002) and RichEdit 4.1 (used in Windows XP SP1 WordPad
and later) also have the toggle, as does RichEdit 5.0 shipped with
Office 2003.

Hope this helps.
Thanks
Murray
 
 



RE: Code points on Windows

2004-01-14 Thread Murray Sargent
Mike Ayers asked: "On Windows, it is well known that you can generate a
character from its code point by holding down the alt key and typing the
code point in decimal, with a leading 0, on the numeric keypad.  I
recall that there is also a method to do this in reverse - given a
character on, say, Wordpad, one can get the Unicode codepoint for that
character (copied to the clipboard, I believe).  However, I have
forgotten how to do this.  Can anyone help me out here?"


It's true that in WordPad you can type in the decimal value of a Unicode
(UTF-32) character value and insert the character. This is valid for
programs that use RichEdit 3.0 or later for editing. But you can often
use a better method with RichEdit controls, the Alt+x method, which uses
hexadecimal characters and is editable on the fly. This works as
follows:
 
You type a character's hexadecimal code (in ASCII), making corrections
as need be, and then type Alt+x. Presto! The hexadecimal code is
replaced by the corresponding Unicode character. The Alt+x can be a
toggle (as in Microsoft Word 2002-2003).  That is, type it once to
convert the hex code to a character and type it again to convert the
character back to a hex code. If the hex code is preceded by one or more
hexadecimal digits, you need to "select" the code so that the preceding
hexadecimal characters aren't included in the code. The code can range
up to the value 0x10, which is the highest character value in the 17
planes of Unicode. The only problem with this approach is that some
programs use Alt+x for something else (like quit) or the keyboard
doesn't have direct access to ASCII alphabetics. For such programs you
can use the "secret" toggle Ctrl+Shift+Alt+F12 instead of Alt+x (only
works with RichEdit, i.e., not with Word).

Murray 





RE: character map in Microsoft Word

2003-12-11 Thread Murray Sargent
Title: RE: character map in Microsoft Word






WordPad uses RichEdit 4.1 on 
Windows XP and both RichEdit 4.1 and 3.0 support the Alt+NumPad numbers greater 
than 255 as Unicode values. But other editors on XP, e.g., NotePad do not 
(sigh). The preferred way with RichEdit is to use the hex code followed by the 
hot key Alt+x, which translates the hex code to the actual character value. For 
example, 1D400 Alt+x inserts a math bold A (although you probably won't see this 
character unless you use the Code2001 font or some other font supporting Plane 
1). This also works for Word 2002 and later.
 
Murray
 




RE: How can I input any Unicode character if I know its hexadecimal code?

2003-11-14 Thread Murray Sargent



Patrick asks: «Q. 
How can I input any Unicode character if I know its hexadecimal 
code?»
 
You 
could use an app that supports the Alt+x input method (like Word or WordPad) and 
then copy the result into an app that doesn't.
 
For 
reference, the Alt+x input method works as follows:
 


A handy hex-to-Unicode entry method works with WordPad 
2000/XP, Office 2000/XP edit boxes, RichEdit controls in general, and in 
Microsoft Word 2002.  Basically you 
type a character’s hexadecimal code (in ASCII), making corrections as need be, 
and then type Alt+x. Presto! The hexadecimal code is replaced by the 
corresponding Unicode character. The Alt+x is a toggle, that is, type it once to convert the hex 
code to a character and type it again to convert the character back to a hex 
code. If the hex code is preceded by one or more hexadecimal digits, you need to 
“select” the code so that the preceding hexadecimal characters aren’t included 
in the code. The code can range up to the value 0x10, which is the highest 
character in the 17 planes of Unicode. 
 
The only problem with this approach is that some programs use 
Alt+x for something else (like quit) or the keyboard doesn’t have direct access 
to ASCII alphabetics.
 
It's not patented, so anyone 
can use it :-)
 
Thanks
Murray


From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Patrick 
AndriesSent: Friday, November 14, 2003 4:06 PMTo: 
[EMAIL PROTECTED]Subject: How can I input any Unicode character if 
I know its hexadecimal code?


http://www.unicode.org/faq/font_keyboard.html states 
: 
«Q. How can I input any Unicode character if I know its hexadecimal 
code?
A. Some platforms have methods of hexadecimal entry; others have only 
decimal entry.On Windows, there is a decimal input method: hold down the 
alt key while typing decimal digits on the numeric keypad. The ALT+decimal 
method requires the code from the encoding of the command prompt. To enter 
Unicode decimal values, you have to prefix the number with a 0 (zero). E.g. 
ALT+0163 is the pound sign ("£"), in decimal. 
There is a hex-to-Unicode entry method that works with WordPad 2000, 
Office 2000 edit boxes, RichEdit controls in general, and in Microsoft Word 
2002.»
I would like to input arbitrary hexadecimal Unicode 
values in an application (XMetal) which does not 
seem to use the RichEdit control. Unfortunately, I don't seem to be able to 
key in a large decimal value (outside of win 1252) using the ALT+0xxx convention 
in XMetal (I'm on a US Windows XP). Is this normal ?
Is it possible — I suspect not — to use the 
Keyboard Layout Creator to specify a similar behaviour to the RichEdit control 
or the standard ALT+? Something like ALT+X+ would correspond the Unicode character associated to 
that hex value. Would be useful, I think.
 
P. Andries
- o - 0 - o - 
Unicode et ISO 10646 en français
http://pages.infinit.net/hapax
 
 
 


RE: Hexadecimal digits?

2003-11-10 Thread Murray Sargent
 
An important part of Ricardo Niemietz's hex digit proposal
(http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2677) is to have columns of
hexadecimal numbers line up properly as columns of decimal numbers do.
This could be achieved using a font with a set of glyph variants for A-F
with a hexadecimal property. Such a glyph variant table could be added
to, for example, an OpenType font. Then a display engine that encounters
a text run with the hexadecimal property could request the hexadecimal
glyph variants. This mechanism is similar to requesting old-style
numerals when a document so specifies. Seems like a good idea. 

As Mark and Ken have pointed out numerous times, changing other aspects
of hex digits is fraught with compatibility problems.

Thanks
Murray



RE: question about Windows-1252 and Unicode mapping

2003-02-27 Thread Murray Sargent
As KenW pointed out, I meant May 1998, not 1988!

Thanks
Murray

-Original Message-
From: Murray Sargent 
Sent: Thursday, February 27, 2003 3:44 PM
To: 'Yung-Fong Tang'
Cc: John Myers; Takayuki Tei; kat momoi; Naoki Hotta; Cathy Wissink;
[EMAIL PROTECTED]
Subject: RE: question about Windows-1252 and Unicode mapping


I think the Euro at 0x80 for 1252 (and several other 125x code pages)
was added in May 1988. Cathy Wissink can confirm this. It certainly
happened before 1999, since we added support for it in RichEdit 3.0
which shipped with Windows 2000 and Office 2000.

Murray

-Original Message-
From: Yung-Fong Tang [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 27, 2003 3:03 PM
To: [EMAIL PROTECTED]
Cc: John Myers; Takayuki Tei; kat momoi; Naoki Hotta
Subject: question about Windows-1252 and Unicode mapping


Hi Folks:

Any one know/sure when does MS start to introduce EURO signed (0x80 -> 
U+20AC), LATIN SMALL LETTER Z WITH CARON (0x9e -> U+017E) and LATIN
CAPITAL LETTER Z WITH CARON (0x8e -> U+017D) into Windows-1252 ? (Date 
and which version of windows? Win98? Win98SE or WinMe ?)

It looks like the table in [1] does not include them, but [2] include
them.

[1] 1252 Windows Latin 1 (ANSI), Appendix H Code Pages, page 464, 
Developing International Software For Windows 95 and Windows NT, Nadine 
Kano, Microsoft Press, ISBN 1-55615-840-8
[2] Microsoft Windows Code Page: 1252 (Latin 1), Appendix I, page 743, 
Developing International Software Second Edition, Dr. International, 
Microsoft, ISBN 0-7356-1583-7

I recently saw some software pass thorugh other (0x-80-0x9f) 
windows-1252 characters correctly but turn these 3 characters to '?' 
 quite interesting result.





RE: question about Windows-1252 and Unicode mapping

2003-02-27 Thread Murray Sargent
I think the Euro at 0x80 for 1252 (and several other 125x code pages)
was added in May 1988. Cathy Wissink can confirm this. It certainly
happened before 1999, since we added support for it in RichEdit 3.0
which shipped with Windows 2000 and Office 2000.

Murray

-Original Message-
From: Yung-Fong Tang [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 27, 2003 3:03 PM
To: [EMAIL PROTECTED]
Cc: John Myers; Takayuki Tei; kat momoi; Naoki Hotta
Subject: question about Windows-1252 and Unicode mapping


Hi Folks:

Any one know/sure when does MS start to introduce EURO signed (0x80 -> 
U+20AC), LATIN SMALL LETTER Z WITH CARON (0x9e -> U+017E) and LATIN
CAPITAL LETTER Z WITH CARON (0x8e -> U+017D) into Windows-1252 ? (Date 
and which version of windows? Win98? Win98SE or WinMe ?)

It looks like the table in [1] does not include them, but [2] include
them.

[1] 1252 Windows Latin 1 (ANSI), Appendix H Code Pages, page 464, 
Developing International Software For Windows 95 and Windows NT, Nadine 
Kano, Microsoft Press, ISBN 1-55615-840-8
[2] Microsoft Windows Code Page: 1252 (Latin 1), Appendix I, page 743, 
Developing International Software Second Edition, Dr. International, 
Microsoft, ISBN 0-7356-1583-7

I recently saw some software pass thorugh other (0x-80-0x9f) 
windows-1252 characters correctly but turn these 3 characters to '?' 
 quite interesting result.





RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Murray Sargent
I think Doug asked for lightweight. HTML and XML markup aren't
lightweight by any means, although a special purpose plain-text oriented
XML (LTML for language-tagged markup language) might not be that much
more involved than plane 14 tags. It would also have the advantage that
standard XSLT tools could be used to translate between LTML and XHTML,
etc.

Murray

Michael Everson wrote:

>At 21:50 -0800 2002-11-12, Doug Ewell wrote:

>>3.  Is there any method of tagging, anywhere, that is lighter-weight 
>>than Plane 14?  (Corollary: Is "lightweight" important?)

>HTML and XML markup?





RE: Names for UTF-8 with and without BOM

2002-11-01 Thread Murray Sargent
Joseph Boyle says: "It would be useful to have official names to
distinguish UTF-8 with and without BOM."

To see if a UTF-8 file has no BOM, you can just look at the first three
bytes. Is this a problem? Typically when you care about a file's
encoding form, you plan to read the file.

Thanks
Murray





RE: script or block detection needed for Unicode fonts

2002-09-29 Thread Murray Sargent
Title: Re: script or block detection needed for Unicode fonts





John Jenkins wrote:

  "This just seems wildly inefficient to me, but then I'm coming 
  from anOS where this isn't done.  The app doesn't keep track of 
  whether or nota particular font can draw a particular character; that's 
  handled atdisplay time.  If a particular font doesn't handle a 
  particularcharacter, then a fallback mechanism is invoked by the system, 
  whichcaches the necessary data.  I really don't see why an 
  application needsto check every character as it reads in a file to make 
  sure it can bedrawn with the set font."
Sigh. If I only had an OS like that to work with! There 
is the ransom-note effect though. Do you try to match the desired font 
characteristics? I should note that Windows XP does have limited "font linking" 
support as well, which works with system fonts. Unfortunately, system fonts have 
limited typographical appeal, so the missing-character/glyph problem is usually 
the responsibility of the application.
Murray




RE: script or block detection needed for Unicode fonts

2002-09-28 Thread Murray Sargent

Michael Everson said:
> I don't understand why a particular bit has to be set in 
> some table. Why can't the OS just accept what's in the font?

The main reason is performance. If an application has to check the font
cmap for every character in a file, it slows down reading the file.
Accordingly programs typically check the script bits of a font to see if
the font claims to support a script. If so, the font is accepted. Else
another font that has the appropriate bit is accepted. This info is
cached, so it's very fast.

A problem occurs more often with fonts that claim to support say Greek
or Cyrillic, but only support the most common characters in these
scripts. In RichEdit we now check the cmap for the less common Greek,
Cyrillic, Arabic, etc., characters to ensure that they are in fact in
the font. If not, we switch to some other font that has them.

The problem with a font setting a script bit when the font only has a
single glyph is that that font may then be used for other common
characters in the script, thereby resulting in a missing-character glyph
at display time.

I suppose one could have it both ways by instructing a program to always
check the cmap for a given font, thereby bypassing the more streamlined
algorithms. This would be a handy option for specialized fonts. We'd
need some font convention to turn on this behavior.

Thanks
Murray




RE: glyph selection for Unicode in browsers

2002-09-26 Thread Murray Sargent

I don't think the idea is that codepage equals language. Rather codepage
equals a writing system, which consists of one or more scripts (e.g., 6
scripts for ShiftJIS). As such the codepage is a useful cue in choosing
an appropriate font for rendering text. In the RichEdit edit engine, we
use a codepage generalization called a CharRep and break Unicode plain
text into runs of text each characterized by a particular CharRep. We
then bind these runs to appropriate fonts for rendering. There are many
additional considerations, so unfortunately this isn't an easy task. But
with enough refinements it works quite well. 

The bottom line is that if text was generated using a particular
codepage it's likely that the creator of that text intended the text to
be rendered with a font that supports that codepage. For text tagged
with no codepage, we do our best to translate the keyboard language to a
CharRep and proceed as above. When neither the keyboard nor codepage
info is available, we use a set of heuristics to break the text into
CharRep runs. Among the many heuristics used are 1) a string containing
Kana is likely to have a Japanese CharRep, and 2) a CJK string that
round trips through CHT, CHS, or ShiftJIS may well belong to those
CharReps. In particular if a CJK string doesn't round trip through CHT,
it's probably not Traditional Chinese.

Murray




RE: Furigana

2002-08-13 Thread Murray Sargent

I agree. The current thinking is that U+FFF9 - U+FFFB are have no
external meaning and shouldn't appear externally, i.e., they are
noncharacters in every way except in the spec (sigh). They can be used
for whatever an implementer wants internally. I mentioned earlier that
the RichEdit edit engine uses them for table-row delimiters, which have
nothing to do with Furigana. Instead, RichEdit 5.0 uses codes from the
U+FDD0 - U+FDEF block for Furigana and various 2D math objects.

Thanks
Murray

-Original Message-
From: Tex Texin [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, August 13, 2002 6:11 PM
To: Murray Sargent
Cc: Michael Everson; [EMAIL PROTECTED]
Subject: Re: Furigana


Murray,

It's true implementers need  some place to attach higher level
protocols, but they don't need specific points for specific
implementations of internal protocols. If they weren't good enough to be
used for exchange, then simply having some unpurposed code points
available for internal use accomplishes the same thing and is available
for other purposes as well. But at the time the annotation characters
were introduced, we were unclear about this.

tex





RE: Furigana

2002-08-13 Thread Murray Sargent

Michael Everson said "Well then they [interlinear annotation characters]
oughtn't to have been encoded."

Michael, you aren't an implementer. When you implement things
unambiguously, you may need internal code points in your plain-text
stream to attach higher-level protocols (such as formatting properties)
to. Such internal code points should not be exported or imported. From
your point of view perhaps, they shouldn't have been encoded. But from
an implementation point of view, they're very handy. Unicode needs to
serve both purposes. For what use would Unicode be if you couldn't
implement it effectively? 

Murray




RE: Furigana

2002-08-13 Thread Murray Sargent

As Ken says the Unicode interlinear annotation characters are for
internal use only. Specifically, their meanings can be different for
different programs. If you have your nice marked up text in memory and
want to export it for use by some program, you need to use a
higher-level protocol that translates the interlinear annotation
characters to a standardized external format, such as HTML. In addition
to U+FFF9 - U+FFFB, there are other characters for internal use only,
namely U+FDD0 - U+FDEF. The meanings of these characters also can (and
do) differ for different programs. Originally it was hoped that the
interlinear annotation characters might be able to describe ruby
adequately, but it became clear that additional information is necessary
to express ruby unambiguously. Hence the UTC adopted them for internal
use only, with associated information presumably stored elsewhere to
resolve the ambiguities.

Frankly IMHO the best thing for a program to do with reading such
characters is to delete them. This isn't quite what one might think from
the Standard since they unfortunately aren't labeled as noncharacters.
But if a program uses them internally with a well defined meaning,
getting them in from an external source can violate the internal usage.
To actually roundtrip these "rogue" characters would require some extra
internal protocol to ignore them when they've been read in. So my edit
engine (RichEdit), which uses them for table row delimiters, simply
deletes them on input and only exports them for RichEdit-specific
contexts.

Murray

-Original Message-
From: Michael Everson [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, August 13, 2002 7:52 AM
To: [EMAIL PROTECTED]
Cc: Ken Whistler
Subject: Re: Furigana


At 12:11 -0700 2002-08-08, Kenneth Whistler wrote:

>Ah, but read the caveats carefully. The Unicode interlinear annotation 
>characters are *not* intended for interchange, unlike the HTML4  
>tag. See TUS 3.0, p. 326. They are, essentially, internal-use anchor 
>points.

What does this mean? That if I have a text all nice and marked up 
with furigana in Quark I can't export it to Word and reimport it in 
InDesign and expect my nice marked up text to still be marked up?

Surely all Unicode/10646 characters are expected to be preserved in 
interchange. What have I got wrong, Ken?
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com





RE: Typing Unicode via Alt+NumPad

2002-08-10 Thread Murray Sargent
Title: Typing Unicode via Alt+NumPad





Actually any application using RichEdit 3.0 or later (e.g, WordPad and 
often Outlook) uses any value higher than 255 as a Unicode value. Values less 
than 255 are also Unicode, except for 0128 - 0159. Note that for values less 
than 255, you need to include the leading 0 since else they are interpreted as 
DOS codes for backward compatibility.
 
Murray

  -Original Message- From: Anto'nio 
  Martins-Tuva'lkin [mailto:[EMAIL PROTECTED]] Sent: Sat 
  8/10/2002 8:37 AM To: [EMAIL PROTECTED] Cc: 
  Subject: Typing Unicode via Alt+NumPad
  I just posted to another list: On Thursday, August 08, 2002, 
  JorgeCandeias wrote:> all characters are accessible by keying 
  down their numeric code while> pressing the ALT key. The codes can be 
  obtained from a number of> sources (and vary from laguage pack to 
  language pack, though all> western european languages are included in 
  just one pack), <...> The> character ß, for instance, is 
  ALT-0223... so I can write straße> without problems. :)Yes, but 
  this works only inside the currently selected code-page, whichis 1-byte 
  (256 chars max). Jorge, using CP1251 (aka ANSI, silimar toLatin1 aka ISO 
  8859-1), can type a german ess-zet or an icelandic thorn,but not a 
  cyrillic hard sign, nor an arabic meem, nor a 
  devanagarivirama.Which is a pity, since all those chars are usable 
  for othermanipulations in every versions of Windows ever since Win98. 
  Hopefullythe Alt+NumPad trick will be implemented soon in Windows and 
  othersystems allowing general and keyboard- and 
  application-independenttyping of any given Unicode 
  char.--   
  .António 
  MARTINS-Tuválkin,   
  |  
  ()|<[EMAIL PROTECTED]>   
  ||R. Laureano de Oliveira, 64 r/c 
  esq. 
  |PT-1885-050 MOSCAVIDE 
  (LRS)  
  Não me invejo de quem tem   |+351 917 
  511 
  549 
  carros, parelhas e montes   |http://www.tuvalkin.web.pt/bandeira/ 
  só me invejo de quem bebe   |http://pagina.de/bandeiras/  
  a água em todas as fontes   
  |




RE: Inappropriate Proposals FAQ

2002-07-03 Thread Murray Sargent

Timothy Partridge included the restriction

- No archaic styles of existing characters. E.g. dotless j.

as something inappropriate. Question: how does one code up (presumably
with markup) a caret over a jk pair in a math expression? The dot on the
j should be missing for this case, but how does one communicate that to
a font if there's no code for a dotless j? It seems that dotless j is
needed for some mathematical purposes.

Thanks
Murray




RE: Can browsers show text? I don't think so!

2002-07-02 Thread Murray Sargent

Michael Jansson says:

> There are no technical reasons for why css/html4/xhtml can not produce
every bit as high quality
> as any other page layout format.


Sadly this is currently far from the case. HTML/CSS even including CSS3
is far from a professional document publishing format. It doesn't even
have center/right/decimal tabs and tab leaders, which virtually all WP
systems have. The list of DTP omissions goes on and on. Defining their
own XMLs is the direction that WP systems are going in for interchange.
XSLT can be used to translate between these XMLs to the extent that the
features are translatable. XHTML/CSS is only used as a fallback for
browsers.

Which isn't to say that XHTML/CSS isn't cool. It is. But currently it's
a weak DTP format at best.

Murray




RE: terminology

2002-05-02 Thread Murray Sargent

"Sentinel" is fairly commonly used in computer science and program code for data 
delimiters. "Delimiter" is also a good word for this (I use it in RichEdit code), but 
one may well use "delimiter" to describe a quote character (like U+0022), whereas I've 
never seen "sentinel" used for a quote. As such "sentinel" seems less ambiguous for 
Unicode code points like U+FDD0 - U+FDEF. It would be interesting to know if anyone is 
using these Unicode "noncharacters" for purposes other than sentinels.
 
Thanks
Murray

-Original Message- 
From: Michael Everson [mailto:[EMAIL PROTECTED]] 
Sent: Thu 2002/05/02 09:25 
To: [EMAIL PROTECTED] 
Cc: 
Subject: Re: terminology



At 15:15 -0400 2002-05-02, Tex Texin wrote:
>Sentinel does have a meaning in software, an extension of "guard" to
>mean a delimiting value.
>
>For instance of usage, see:
>http://www.unicode.org/unicode/standard/versions/Unicode3.0.1.html

Try finding another software meaning using this word, please, not one
from Unicode.

>Besides, we are creating terms and definitions here. Like Humpty Dumpty
>says "words mean exactly what I want them to mean." ;-)

And in the world of internationalization this stuff has to be
translated. It has to make sense. Quick-and-dirty Californian
"definitions" cause problems for other people in the world because
the images or idioms may not be universal. Sentinal does not seem to
me to be equivalent to "literal". "Delimiter" seems better.
--
Michael Everson *** Everson Typography *** http://www.evertype.com







RE: Concerning mathematics

2002-03-08 Thread Murray Sargent

Stefan Persson [mailto:[EMAIL PROTECTED]] asks how in the formula
 
mfågel = 1 kg

would the italic å be encoded? 
 
 
Mathematics has a set of standard letters for mathematical symbols. They can include 
diacritics, which can be expressed using the appropriate combining marks. In your 
formula
 
mfågel = 1 kg

the m is a mathematical symbol, while the fågel is a natural language subscript. 
Italic shouldn't be used for such a subscript, since italic is used for symbols in 
mathematical notation (and consequently mathematical journals will change to fågel 
for this case). Else one might construe fågel to be a subscript consisting of the 
product of the five variables. Such natural language text is conveniently done with 
characters from the BMP, although you need some kind of markup to turn it into a 
subscript. If you insist on using italic for this kind of text and for characters like 
the italic ø that aren't used in standard mathematical notation, you can fall back to 
markup. Since such usage is extremely rare and not recommended for mathematical text, 
it wasn't perceived as important to represent unambiguously in plain text.
 
Murray
 




RE: How to make "oo" with combining breve/macron over pair?

2002-03-05 Thread Murray Sargent

MathML does have markup to extend diacritics across arbitrary numbers of
characters and it's not likely that MathML would use the CGJ for this
purpose. But it would be handy for representing such expressions in
plain-text Unicode.

Murray 




RE: CRLF vs. LF (was Re: Unicode and end users)

2002-02-21 Thread Murray Sargent

I agree that NotePad ought to be able to display a pure LF file
correctly. Word and WordPad do. However they do translate the LFs to
CRLFs on saving, which limits their interoperability with Unix. It would
be fairly easy to have an option to write LF files, if there's
sufficient interest. 

Murray

-Original Message-
From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, February 21, 2002 8:40 AM
To: Lars Kristan; 'David Hopwood'; [EMAIL PROTECTED]
Subject: Re: CRLF vs. LF (was Re: Unicode and end users)
Importance: Low


From: "Lars Kristan" <[EMAIL PROTECTED]>

> A - When writing, no CR characters will be written (unless read from a

> file). Many programs (like notepad) will not display such files 
> correctly. It is a good question whether this is my problem or 
> notepad's.

Yours -- since you are feeding it files that it does not accept the
format of? Obviously notepad is not a tool for you -- try IE. :-)

> B - When reading, I will get CR characters which are not handled 
> anywhere
in
> my current code. Of course this only happens when reading files that 
> were written 'the old way'.

Well, again that would be you.

The problem here is that you are looking at UNIX defaults and seeing how
poorly Windows handles them -- but that is why Windows is not UNIX. Each
has its own defaults, and only the people who wish to straddle the two
worlds will hit problems here


MichKa

Michael Kaplan
Trigeminal Software, Inc.  -- http://www.trigeminal.com/






RE: Proposing Fraktur

2002-01-29 Thread Murray Sargent

David Starner said:
 
"Fraktur is not a different script from the Latin script, and therefore is
not encoded separately."
 
True, but Fraktur math characters are encoded in plane 1 for use in mathematics. These 
characters are not intended to be used for natural language purposes (unless you think 
of mathematics as a natural language :-) In which case, it's probably the only truly 
international natural language.
 
Thanks
Murray




RE: The benefit of a symbol for 2 pi

2002-01-21 Thread Murray Sargent

Capital pi is to product as capital sigma is to summation.

-Original Message- 
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Sun 2002/01/20 02:19 
To: [EMAIL PROTECTED] 
Cc: [EMAIL PROTECTED] 
Subject: Re: The benefit of a symbol for 2 pi



In a message dated 2002-01-19 17:07:34 Pacific Standard Time,
[EMAIL PROTECTED] writes:

> In fact Cajori mentions that
> the capital pi Was used at some point for 6.28... so someone had
> the same idea long before I did.

That is a VERY intriguing thought, one that should be especially worthy of
mention to the AMS.  I thought capital pi already had an established meaning,
but perhaps that is in physics or some other branch of science rather than
mathematics.

-Doug Ewell
 Fullerton, California







RE: ISCII-Unicode Conversion

2001-11-06 Thread Murray Sargent

Marco Cimarosti writes:
> Tom Emerson wrote:
> > One gotcha, that I run into every six months or so, is forgetting that
> > the punctuation characters in the Basic Latin block are classified as
> > Latin script. This trips me up because most of my text processing work
> > involves CJK, so I'll write something to filter latin characters with
> > (in Rosette notation):
> 
> That must be a Rosette-specific behavior: in UTR#24 (and in its database
> ), the only ASCII-range code-points classified as "Latin" are
> the upper- and lower-case letters.

Indeed. It turns out that the Rosette script assignments (in the
version I'm using) predate UTR#24 by three or four years and are based
on the information in  with some hand editing by engineers
long past.

The next major Rosette release, which includes Unicode 3.1 support,
will use the data from UTR#24, and my problem will mostly go away.

-tree

-- 
Tom Emerson  Basis Technology Corp.
Sr. Computational Linguist http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"




RE: Character encoding at the prompt

2001-10-24 Thread Murray Sargent

At the MS_DOS prompt, type "chcp"  to see the active code page. You can also 
use this command to change the active code page.

Murray

-Original Message-
From: Tay, William [mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, October 24, 2001 2:08 PM
To: [EMAIL PROTECTED]
Subject: Character encoding at the prompt


Hi,

Do you have any idea what is the default code page and encoding scheme for MS DOS box 
in WinNT 4? Is there any command that can give me the info? I am trying to input a 
string say "fráç" at the prompt, wondering how the characters are encoded.

How about at the Unix (Solaris 2.6) prompt, what's the default and how to change? 

Thanks. 

Will





RE: GB18030

2001-09-21 Thread Murray Sargent

I think I've figured out a way to find the beginning of a GB18030 character starting 
anywhere in a document. The algorithm is similar to finding the beginning of a DBCS 
character in that you scan backward until you find a byte that can only come at the 
start of a character. The main difference is that you check for being in four-byte 
characters first (those of the form HdHd, where H is a byte in the range 0x81 - 0xFE 
and d is an ASCII digit). If a four-byte character isn't involved (ordinary GB 
don't use d as a trail byte), you revert to the DBCS approach for handling the rest of 
GB18030. 
 
This algorithm is handy when you want to stream in a file in chunks and need to know 
if a chunk ends in the middle of a character. One can also solve this particular 
problem by keeping track of character boundaries from the start of stream, but 
typically more processing is involved.
 
Murray

-Original Message- 
From: Carl W. Brown [mailto:[EMAIL PROTECTED]] 
Sent: Fri 2001/09/21 04:56 
To: Charlie Jolly; [EMAIL PROTECTED] 
Cc: 
Subject: RE: GB18030



Charlie,

GB18030 is designed to support all Unicode characters.  It has the capacity
to also encode additional characters.  I know of no plans to do so.

I don't think it will have much affect on Unicode.  Most systems that handle
GB18030 will want to convert it to Unicode first to reduce processing
overhead.  With most of the common MBCS code pages you can determine the
length of the character from the first  byte.  With GB18030 you some times
have to check the first two characters.  UTF-8 for example is an MBCS
character set but if I am going backwards through a string I can do so.
With GB18030 I must start over from the beginning of the string to find the
start of the previous character.

It is smaller that UTF-8 for Chinese and larger for anyone else.

Carl

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Charlie Jolly
> Sent: Friday, September 21, 2001 1:42 AM
> To: [EMAIL PROTECTED]
> Subject: GB18030
>
>
> GB18030
>
> In what ways will this effect Unicode?
>
> Does it contain anything that Unicode doesn't?
>
>
>
>








RE: Unicode/font questions.

2001-08-01 Thread Murray Sargent

Actually fonts on Windows are normally Unicode based (including MS
Mincho and MS Gothic) and most have in addition some codepage access. So
there is neither a perf hit nor a codepage problem in using such fonts
on NT, Win2000 and WinXP. These considerations are orthogonal to
OpenType.
 
Murray

-Original Message- 
From: Richard, Francois M 
Sent: Wed 2001/08/01 05:40 
To: [EMAIL PROTECTED]; [EMAIL PROTECTED] 
Cc: 
Subject: Unicode/font questions.



Since Win2000 and NT are native Unicode, is it true to say that
any use of a
non-Unicode font (in fact most of the fonts on Windows. And in
particular
Asian font like MS Mincho, MS Gothic) in a Unicode application
will generate
a conversion WideCharToMultibyte (to convert the Unicode text to
the
specific font codepage)?
Is this a big performance hit?
Can this create mapping issues (e.g. Unicode <-> Chinese
character
encoding)?
Are we sure that if a font is installed on a machine, then the
appropriate
codepage is going to be available too (for the conversion)?

What about "extending" current non-Unicode font to support
Unicode? Like a
"MS Mincho Unicode"... It would still be specialized/dedicated
to Asian
glyphs, but by using Unicode character encoding, it would not
require the
WideCharToMultibyte conversion...

Is Open TrueType related to this?

François








RE: UTF-17

2001-06-22 Thread Murray Sargent

Hey guys, Ken is just kidding. He's evidently tired of the current
plethora of ways to represent Unicode let alone all those new ones being
proposed. Sigh, I am too. Carl, you understand the problem of adding yet
another UTF: you too will probably have to support it.

Murray

Carl Brown asked Ken:

> Can you give us a hint as to what this [UTF-17] would be used for?
> 
> 




RE: converting ISO 8859-1 character set text to ASCII (128)charactet set

2001-06-20 Thread Murray Sargent

If you need to roundtrip 8859-1 through ASCII, you need to use some kind
of escape mechanism inside the ASCII to represent characters that have
their high bit equal to one. A common simple escape is to use the
backslash. So you could represent the codes as \'xx, where xx is the
hexadecimal code. For this to work, you need to represent backslash
itself in some distinctive way like \'5C or maybe \\. Similarly you
could use \a to represent 'a' with the high bit set, that is á. Etc.

Murray




RE: More trivia: Misc. Math. Symbols-B and decomposition

2001-06-08 Thread Murray Sargent

It's intriguing to think of an encoding for math symbols that breaks
them down into sequences of pieces. For example, NOT EQUAL could be
EQUAL followed by a slash combining mark.

Maybe some day a "cleanicode" will be developed that handles this and
related characters in a consistent, uniform way. Until then (if that
ever happens), we live in a world where computer systems have evolved
differently and compatibility with existing math character repertoires
has led to encoding composed symbols. While compatibility decompositions
might appear to be a good idea to add at this point, they'd introduce
complexity that doesn't seem warranted for the relatively few symbols
involved.

Thanks
Murray 

-Original Message-
From: Marco Cimarosti [mailto:[EMAIL PROTECTED]]
Sent: Friday, June 08, 2001 3:33 AM
To: '[EMAIL PROTECTED]'
Subject: More trivia: Misc. Math. Symbols-B and decomposition


I was peeping in the "Miscellaneous Mathematical Symbols-B" block
proposed
for Unicode 3.2
(http://www.unicode.org/charts/draftunicode32/U32-2980.pdf),
when I noticed that many of those character could have been composed
using
an existing base character and an existing non spacing mark.

For instance:

- 29B1..29B4 (empty sets) could be composed using various diacritics
(0305,
030A, 20D6, 20D7);

- 29B5..29C3 (circle symbols) could be composed using 20DD (COMBINING
ENCLOSING CIRCLE)

- 29C4..29C8 (square symbols) could be composed using 20DE (COMBINING
ENCLOSING SQUARE)

- 29CC (triangle symbols) could be composed using 20E4 (a new combining
encoding triangle, also in 3.2:
http://www.unicode.org/charts/draftunicode32/U32-20D0.pdf)

But they don't have any compatibility decomposition. And this is also
true
for all existing symbols that could be composed with diacritics. What is
the
rationale for this choice?

_ Marco





RE: Math operators

2001-06-05 Thread Murray Sargent
Unicode has many multiplication signs, e.g., U+00D7, U+00B7, U+2022,
U+2219, U+2299, U+22A0, U+22C6, etc. In this spirit, you can probably
include U+2605 ($B!z(J)

Murray 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, June 05, 2001 11:59 AM
To: [EMAIL PROTECTED]
Subject: Math operators


Is the multiplication sign in my sig the same thing as the Unicode
multiplication sign? I mean, what I get when I type "Kakeru" into my
word processor?


$B$i$s$^(J $B!z$8$e$&$$$C$A$c$s!z(J
$B!!!_$"$+$M(J 
$B!;
$B08@h(J: Gaute B Strokkenes <[EMAIL PROTECTED]>;
Cc: Marco Cimarosti
<[EMAIL PROTECTED]>;[EMAIL PROTECTED];'Raghvendra Sharma'
<[EMAIL PROTECTED]>;
$BF|;~(J: 01/06/05 15:43
$B7oL>(J: Re: Query, please help

>Well, I was assuming he meant things like Greek letters and the 
>mathematical operators at U+2200.
>
>I realize that there are different types of mathematical notation, but 
>I think that's an additional, and more complicated,  issue.  Especially

>given that the Unicode mathematical operators represent shapes and not 
>meaning.
>
>Nonetheless, my basic point is that using images is valid approach, 
>especially given the nature of the problem that the original poster is 
>trying to solve.
>
>@D
>
>- Original Message -
>From: "Gaute B Strokkenes" <[EMAIL PROTECTED]>
>To: "David Gallardo" <[EMAIL PROTECTED]>
>Cc: "Marco Cimarosti" <[EMAIL PROTECTED]>; 
><[EMAIL PROTECTED]>; "'Raghvendra Sharma'" <[EMAIL PROTECTED]>
>Sent: Monday, June 04, 2001 9:10 PM
>Subject: Re: Query, please help
>
>
>> On Mon, 4 Jun 2001, [EMAIL PROTECTED] wrote:
>>
>> > Since he wants math symbols on his buttons, which presumably would 
>> > not require localization, using images is not really blasphemous.
>>
>> Sorry.  Mathematical notation can vary quite widely.
>>
>> --
>> Gaute Strokkenes
http://www.srcf.ucam.org/~gs234/
>> HELLO KITTY gang terrorizes town, family STICKERED to death!
>>
>
>
>


RE: Script use by Mathematicians (Was Re: Single Unicode Font)

2001-05-22 Thread Murray Sargent

>Has anyone ever made a character collection for mathematics?
 
Please check out Unicode 3.1 and 3.2 (coming up). Characters from the
STIX collaboration of a variety mathematical sources such as AmSTeX and
MathML have been collected into a math character set that seems to have
the vast majority of math symbols currently used in technical
publications.
 
Thanks
Murray


 winmail.dat


RE: Property error for U+2118?

2001-02-01 Thread Murray Sargent

The Weierstrass symbol U+2118 isn't a capital letter in spite of its name,
nor is it really an alphabet character. It's sort of a stylized mixture of a
rho and a lower-case script p. However in view of the principle that
character names never change, even if incorrect, this symbol remains the
SCRIPT CAPITAL P.  Fortunately at least its category could be corrected.

Murray

-Original Message-
From: John O'Conner [mailto:[EMAIL PROTECTED]]
Sent: Wed, January 31, 2001 8:46 AM
To: Unicode List
Subject: Property error for U+2118?


Is this an error or intentional change? I noticed that all other "SCRIPT
CAPITAL *" character values are in the "Lu" category. However, this
particular character has changed to "So" in the 3.0, 3.0.1, and 3.1 db.
Why? Why not the other SCRIPT CAPITAL * char values too?

2118;SCRIPT CAPITAL P;So;0;ON;N;SCRIPT P

Regards,
John O'Conner



RE: Benefits of Unicode

2001-01-29 Thread Murray Sargent

In some of my talks at the Unicode conferences (see "Tips and Tricks..."), I
have addressed problems with Unicode, notably trying to figure out whether
to use a Chinese Simplified/Traditional, Japanese, or Korean font to render
a Chinese character inserted in a plain-text scenario.  This is a real
scenario that happens millions of times a day in Windows dialog text
controls.  Sometimes things can be ambiguous.  Fortunately in the vast
majority of cases you have auxilliary information with which to resolve such
ambiguities.  

Another problem area is the complexity of Unicode due to, for example, the
multiple representations of the same end glyph combination.  This particular
problem is well known and is discussed by the UTC normalization TR.
Backward compatibility with a myriad code pages has made Unicode viable, but
it has added lots of complexity.  Sometimes I dream of 25 years hence when
all the issues are well understood. At that time one might define a
"cleanicode" that does the job magnificently.  Except that you'd have to
translate to/from, so it'd probably never catch on.

Murray

-Original Message-
From: Richard Cook [mailto:[EMAIL PROTECTED]]
Sent: Sat, January 27, 2001 1:37 PM
To: Unicode List
Cc: Unicode List
Subject: Re: Benefits of Unicode


Has anybody played devil's advocate to this, with a list of "Failings of
Unicode"? Are there any? :-) This question might in fact result in a
longer Benefits list 

> Tex Texin wrote:
> 
> I was asked to produce a list of the benefits of Unicode to be used
> as a sidebar with an article referencing Unicode.
> 
> Ideally, it would be a brief set of bullets for an audience that
> doesn't know a lot about internationalization. The bullets shouldnt
> be too detailed or technical.
> 
> I came up with the attached table. I think with some minor amendments
> I can drop the left column now and just use the benefits with
> examples.
> 
> I have until Monday morning to turn it in, so I thought I'd ask
> for some review.
> 
> Anything missing? Any constructive suggestions, gratefully
> appreciated.
> 
> tex
> 
> ---
> 
> 
> 
>   Benefits of Unicode
> 
>  Unicode
>Properties  Benefits   Example
> 
> All the   Invoice or
> characters of Multi-lingual   ticketing
> all the   documents: use any  applications can
> languages you or all the  print native
> might ever need   languages you want  language names
> 
>   Reduced
>   development and
> Defines one set   support costs and   Sales to
> of algorithms reduced multiple
> for processingtime-to-market, countries the
> text  with one versionday of initial
>   of source code  release
>   that works
>   world-wide
> 
>   Any applications
>   reading the same
> An ISO standard   Standards insuretext file will
>   interoperability
>   interpret it
>   correctly
> 
>   Text sent from
> Accepted  Worldwide   any part of the
> globally  deployment  world to any
>   capability
>   other part
> 
> Supported by  Applications can
> most, if not all  Ease of exchange text
> modernintegration without
> technologies  conversion loss
>   or errors
> 
>   XML, the format
>   for structured
> Web standards documents and
> are based on it   Internet-readiness  data on the Web
>   is
>   Unicode-based
> 
>   Unicode Version
>   3.0 added
>   25,000+
>   Evolution extends   characters and
> Undergoes application new technical
> continuouslifetime andspecifications
> development   expands that improved,
>   capabilities to
>   meet future needs   for
>   example, Middle
>  

RE: Transcriptions of "Unicode"

2000-12-06 Thread Murray Sargent

[EMAIL PROTECTED] said: "I don't think it's reasonable to expect a browser
to apply various heuristics to determine the language."

Since people really want Japanese to be displayed with a Japanese font and
Chinese with a traditional or simplified Chinese font, the RichEdit facilty
and some other Microsoft software apply heuristics to figure out which font
to use for plain-text Chinese characters.  We usually get it right, but as
with any heuristics, errors can occur.  CJK unification is a great idea, but
unfortunately it isn't perfect in this respect.  I've talked about this
problem and our heuristics at a couple of recent Unicode conferences.  

With rich text, language attributes can be used which gets around the
problem.  In principle language attributes can also be inserted into plain
text, namely as language mnemonics comprised of plane-14 annotation
characters.  But such characters can confuse text clients, so we haven't
used them.

Thanks
Murray



RE: lag time in Unicode implementations in OS, etc?

2000-10-13 Thread Murray Sargent

It would be great if things were that easy.  But users typically don't want
to worry about fonts.  They enter a character, maybe by pasting plain text,
and want it magically to appear as something other than the
"missing-character" glyph.  They probably don't even know if it's a
supplementary-plane character.  So the underlying software has to figure out
an appropriate font to use.  It really wasn't possible to finalize such
software until the codepoints for specific characters were officially
defined, i.e., until the Athens WG2 meeting last month. It's still not a
trivial matter to figure out which fonts to use for plane 2, partly because
different locales may prefer different glyphs (the usual CJK unification
problem, which is particularly tricky in multilingual East Asian contexts).

With Windows 2000 and WordPad, say, you can enter a Plane-2 character (try
2 Alt+x) and select a font to display it. But you have to select an
appropriate font, it's not automatic.  It'll get better now that we know
where the codepoints are assigned.

Murray

> -Original Message-
> From: Markus Scherer [SMTP:[EMAIL PROTECTED]]
> Sent: Thursday, October 12, 2000 3:03 PM
> To:   Unicode List
> Subject:  Re: lag time in Unicode implementations in OS, etc?
> 
> so, what is there to be turned on and off in win2k if surrogate pairs are
> already handled as single units?
> if fonts just don't contain mappings and glyphs for pairs, then the layout
> engine will ignore them anyway until fonts provide that data.
> 
> markus
> 
> > John McConnell wrote:
> > 
> > Windows 2000 does support surrogates as defined in Unicode 2.0 e.g. it
> recognizes them when
> > converting to/from UTF-8 & OpenType recognizes new cmap types for
> surrogates.
> 
> that's great!
> 
> > The remaining steps e.g. fonts that display Ext B and sorting methods
> that integrate surrogate
> > pairs in culturally correct ways, depend on the final assignments of the
> new ranges. That isn't
> > in Unicode 2.0 (or 3.0).
> 
> of course.
> 
> > Chris Pratley wrote on 2000-oct-03:
> > > Surrogate support was not turned on by default in Win2000 because the
> > > Windows team was waiting for the standard to be finalized. It was also
> added
> > > late, so to reduce the potential impact they had it off - a safe bet
> since
> > > the standard was still 1+ years from completion.
> > 
> > which standard? unicode 2.0 introduced surrogates in 1996. iso 10646-1
> got amended with utf-16 in 1996, too.
> > there was nothing new in the technical issues of how to deal with utf-16
> since then.
> > 
> > > Chris



RE: surrogate terminology

2000-09-12 Thread Murray Sargent

For what it's worth, I've been referring to characters between 0x1 and
0x10 as "higher-plane" characters as distinguished from BMP characters.
Seems to work well in a general way. For plane 1, I use "plane=1"
characters, etc.

Murray



RE: the Ethnologue

2000-09-12 Thread Murray Sargent

Rick asks, 
>>Can anyone point me to an existing list of languages that is more
comprehensive and better
>> researched than the Ethnologue?  If there is no such list, then we don't
need to consider any
>>alternatives, right?

I've heard that the Ethnologue deals only with currently spoken languages
and doesn't provide codes that distinguish between dialects. It would be
nice to have a more general list of language codes.  It's important for
spell checking to distinguish between, say, British and American English.
The Ethnologue describes some such differences in text, but doesn't appear
to provide a corresponding list of secondary language codes (pls correct me
if I'm wrong).

Thanks
Murray



RE: russian character string

2000-09-11 Thread Murray Sargent

Try converting the Russian character string to Unicode using
MultiByteToWideChar(1251,...) and then pass the Unicode string to the edit
control.

Murray

> I have problems with passing russian character string (in MS FoxPro 6.0)
> to
> DHTM Edit Control (MS). Probable reason is that string encoding is ANSI
> and
> default in_stream encoding for DHTM Edit Control is Unicode. What can I
> do?
> 
> Thank you
> 
> Best regards,
> Dr. Grigoryants



RE: Identifying a Unicode character

2000-08-18 Thread Murray Sargent

If you can get the text into a Win32 RichEdit control version 3.0 or later
(Office 2000 and/or Windows 2000 in WordPad), type Shift+Alt+x after the
character and the character will be replaced by its Unicode hexadecimal
value. If you type Alt+x, that code gets converted back into the Unicode
character. 

In the next version of Office, Word also supports Alt+x and makes it into a
toggle, that is, Alt+x will toggle a character back and forth between the
character and the character's Unicode hex value.  RichEdit 4.0 does the
same. Having used this facility for a couple of years now, I can't imagine
living without it.  The method is quite portable and could be used readily
on nonWindows OSs.

Murray

-Original Message-
From: David J. Perry [mailto:[EMAIL PROTECTED]]
Sent: Thu, August 17, 2000 4:09 AM
To: Unicode List
Subject: Identifying a Unicode character


Listmembers,

If I receive a Word document created with a font I don't have, and my 
Unicode fonts (even Lucida Sans Unicode or Arial Unicode) don't have that 
character, is there any way to find out what Unicode value underlies the 
little rectangle that is displayed?  Then I could look up the value and 
find out what the character is supposed to be.  I know how to get Word to 
convert a hex number into a real Unicode character--but can one do the
reverse?

Thanks -- David



RE: APL letters

2000-07-17 Thread Murray Sargent

One interesting possibility for representing the APL italic characters would
be to use the math italic alphabet in plane 1. The motivation for their use
in APL is similar to that for the math case: the characters are separate
symbols, e.g., they don't get grouped into natural language words.  In fact,
they typically represent math variables, so using the same notation is
natural as well as helpful if you want to write an APL program to study a
mathematical expression.

Murray 

> -Original Message-
> From: Frank da Cruz [SMTP:[EMAIL PROTECTED]]
> Sent: Monday, July 17, 2000 8:55 AM
> To:   Unicode List
> Subject:  APL letters
> 
> Sorry for not remembering the outcome of previous discussions on this...
> 
> The character set used by APL programming language includes special forms
> of the uppercase Latin letters A-Z, usually italized and/or underlined.
> 
> In an APL program, one might also need to include regular uppercase Latin
> letters A-Z, e.g. in character strings.
> 
> I don't remember APL well enough to recall whether this is an important
> distinction or simply a matter of style.  If it's a significant
> distinction,
> what the Unicode position on how to maintain it in a APL program written
> in Unicode 3.0?  If the answer is that there is no distinction, will this
> change because of STIX?
> 
> Thanks!
> 
> - Frank



RE: Names of planes, and request for sneak preview

2000-07-11 Thread Murray Sargent

None of the code values ending in 0xFFFE and 0x refer to characters,
i.e., 0xFFFE, 0x, 0x1FFFE, 0x1, etc.  These values exist for
internal use only.

Murray

-Original Message-
From: john [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, July 11, 2000 5:17 PM
To: Unicode List
Subject: Re: Names of planes, and request for sneak preview

> Asmus Freytag wrote:
> There are 0x10 - 34 possible characters!

> All code values ending in 0xFFFE and Ox do *not* refer to characters. 
> They are not just temporarily unassigned, but permanently reserved as 
> non-characters.

Clarification request: Does that mean
None of the code values ending in 0xFFFE and 0x refer to characters?

or

Not all of the code values ending in 0xFFFE and 0x refer to characters
(i..e some do and some do not)?



RE: Euro character in ISO

2000-07-11 Thread Murray Sargent

The two statements are correct. ISO has addressed the problem by adding more
ISO-8859-x standards, since changing 8859-1 would cause problems.  The best
thing to do is to use Unicode and avoid the codepage confusion :-)

Murray

> -Original Message-
> From: Leon Spencer [SMTP:[EMAIL PROTECTED]]
> Sent: Tuesday, July 11, 2000 2:26 PM
> To:   Unicode List
> Subject:  Euro character in ISO
> 
> The Euro does not exist in iso-8859-1. It
> is in Cp1252 (WinLatin1) - Microsoft's code page
> superset of iso-8859-1. 
> 
> Is this correct? Has ISO addressed the Euro character?
> If so, it the issue more of vendors implementing it?
> 
> Leon



RE: Plane 14 language tags

2000-06-29 Thread Murray Sargent

Please note that the language tags in plane 14 are pure ASCII in nature.
The Turkish I problem doesn't enter, nor do ß and accented latin characters.

Thanks
Murray

> -Original Message-
> From: Antoine Leca [SMTP:[EMAIL PROTECTED]]
> Sent: Thursday, June 29, 2000 7:56 AM
> To:   Unicode List
> Subject:  Re: Plane 14 language tags
> 
> Brendan Murray wrote:
> > 
> > Murray Sargent <[EMAIL PROTECTED]> wrote:
> > > Note that in C, it's essentially just as fast to make character
> > > comparisons with (ch | 0x20) as with ch alone, i.e., if you know
> > > ch is in an ASCII range (0 - 0x7F or 0xE - 0xE007F), you can
> > > do a case insensitive compare as quickly as a case sensitive one.
> >
> > Except, of course, in Turkey where the lowercase of 'I' is not 'i' and
> the
> > uppercase of 'i' is not 'I'.
> 
> Unless I missed a very recent draft (that ought to be refused, IMHO),
> Turkey (or Azerbaijani) was not used for the plane 14 language tags,
> was it?
> 
> And of course, the lowercase of "SS" in German is sometimes ß, the lower
> case of an initial "E" in French followed by a consonnant is more often
> "é" than "e", except if followed by "x"/"X" or a doubled one (like
> "ff") or two consonant, first a nasal (like "mb", "nc", "MP", ...),
> the lowercase of Italian (or Corsican) "A'", "E'", ... at the end of a
> word is likely to be "à", "é/è", ... (Marco, is it really true? and how
> é and è are handled?)
> Et cætera.
> 
> 
> Antoine



RE: Plane 14 language tags

2000-06-28 Thread Murray Sargent

Note that in C, it's essentially just as fast to make character comparisons
with (ch | 0x20) as with ch alone, i.e., if you know ch is in an ASCII range
(0 - 0x7F or 0xE - 0xE007F), you can do a case insensitive compare as
quickly as a case sensitive one.  The problem with assuming lower case is
that the input might not all be in lower case.  I remember all too well
having to accept RTF control words with upper-case letters even though the
RTF spec and Word both specifically use all lower case for these words.

Murray

> -Original Message-
> From: Kenneth Whistler [SMTP:[EMAIL PROTECTED]]
> Sent: Wednesday, June 28, 2000 12:03 PM
> To:   Unicode List
> Cc:   [EMAIL PROTECTED]
> Subject:  Re: Plane 14 language tags
> 
> Doug Ewell asked:
> 
> > 2.  (Ken and Glenn) Can you explain in a little more detail the
> rationale
> > for lowercasing the entire language tag?  It seems that if RFC 1766
> > is the model to be followed, then the RFC 1766 casing convention
> > (lowercase for language, uppercase for country) might be preferred.
> 
> John Cowan's non-authoritative response was fine by me -- and was
> better-expressed than this author would probably have done. ;-)
> 
> > I guess I don't see how lowercasing the entire tag simplifies or
> > speeds up anything, since the hyphen which separates language from
> > country is outside the range of lowercase letters anyway and
> > processes that want to ignore LT's must ignore the entire range from
> > U+E through U+E007F.
> 
> It is not a matter of range-checking. For ignoring tags, you would always
> check the entire range. Rather, it is just a suggestion that since
> case is not significant in the language tags, it is slightly preferable
> to do the early "normalization" (i.e. case folding to lowercase, in
> this instance), rather than emitting arbitrarily mixed case tags
> and distributing the case-folding burden to all the interpreters of
> the tags.
> 
> --Ken Whistler



RE: Twinbridge & Word 2000

2000-06-27 Thread Murray Sargent

> [EMAIL PROTECTED] asked: The question is: Is there any way for making True
> type fonts and Unicode compatible? 
> 
The answer to this question is: Microsoft's implementation of TrueType has
always been based on Unicode, right from the first version in 1992.  The
answer to the original question, namely why Word 2000 does not recognise
chinese captured from Twinbridge files isn't so simple.

If you know the charset used, you can convince Word 2000 to use that
encoding as follows: Start Word 2000, select the menu item
Tools/Options/General and turn on the "Confirm conversion at Open" option.
Then open the file with a .txt extension and you should see a "Convert File"
dialog box with a list of conversion alternatives. Select "Encoded Text",
whereupon you should see another dialog box entitled "File Conversion"
followed by the name of your file.  Choose "Other encoding" instead of Plain
text.  There are 7 Chinese codepages to choose from.  Hopefully one of these
will work. 

Murray
>