Re: Nicest UTF

2004-12-02 Thread Doug Ewell
Philippe Verdy  wrote:

> All UTF encodings (including the SCSU compressed encoding, or BOCU-8
> which is a variant of UTF-8, or also now the GB18030 Chinese standard
> which is now a valid representation of Unicode) have their pros and
> cons.

UTF's by definition are stateless and have exactly one valid
representation for each code point.  So SCSU, much as I like it, is not
a UTF.

BOCU-1 is also not a UTF, and in particular there is no conceivable way
it can be regarded as "a variant of UTF-8."  I have no idea what
"BOCU-8" is.  Maybe that one really is a variant of UTF-8.

Though not promulgated by Unicode, GB18030 can be considered a UTF,
since it is really just a mapping from Unicode code points to sequences
of 1, 2, or 4 bytes.

Later:

> SCSU is excellent for immutable strings, and is a *very* tiny overhead
> above ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is
> extremely trivial, may be even simpler than to UTF-8!)

An ISO 8859-1 string that contains no controls except NUL, CR, LF, and
Tab is *already* in SCSU.  No conversion needed.

I appreciate Philippe's support of SCSU, but I don't think *even I*
would recommend it as an internal storage format.  The effort to encode
and decode it, while by no means Herculean as often perceived, is not
trivial once you step outside Latin-1.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Nicest UTF

2004-12-02 Thread Doug Ewell
This thread amuses me.

I feel like I know quite a bit about the various Unicode encoding forms
and schemes, and my personal opinion is that UTF-16 combines the worst
of UTF-8 (necessity to support multi-code unit characters, regardless of
how "rare") with the worst of UTF-32 (high overhead for many scripts).
Yet there is a Technical Note, UTN #12, that encourages users to use
UTF-16 for internal processing, for exactly the opposite reasons.

So I think the word "nice" is actually quite appropriate for this
thread.  It implies a personal aesthetic judgment, which is what is
really being discussed here.

I use UTF-8 for most interchange (such as this message; OE doesn't allow
me to send UTF-16) and UTF-32 for most internal processing that I write
myself.  Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.

I do wish the following statements would stop coming up every time this
subject is debated:

(1)  UTF-32 doesn't really guarantee one code unit per character, since
you still have to worry about combining sequences.
(2)  Write functions that deal with strings, not characters, and the
difference becomes moot.

Both statements (which are really variations on the same theme) miss the
point somewhat.  Combining sequences and other interactions between
encoded characters don't change the fact that sometimes you have to deal
with strings, and sometimes you have to deal with individual characters.
That's just the fact.  Both types of processing are important.

I also think that as more and more Han characters are encoded in the
supplementary space, corresponding to the ever-growing repretoires of
Eastern standards, the story that UTF-16 is virtually a fixed-width
encoding because "supplementary code points are very rare in most text"
will gradually go away.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: current version of unicode-font

2004-12-02 Thread James Kass

John Cowan wrote,

> In what, breadth of coverage or aesthetics?  The GNU Unifont has very
> wide coverage though it is a bitmap font; James Kass's CODE 2000 and CODE
> 2001 probably have the widest coverage of any font, though it costs US$5
> to use them.  

Code2001 is freeware.

> Both of them IMHO are a tad on the ugly side.

There's always room for improvement.

Best regards,

James Kass




Re: current version of unicode-font

2004-12-02 Thread Richard Cook
On Thu, 2 Dec 2004, John Cowan xiele:

> Paul Hastings scripsit:
>
> > speaking of which, *are* there any open source fonts that come even
> > close to Arial Unicode MS?
>
> In what, breadth of coverage or aesthetics?  The GNU Unifont has very
> wide coverage though it is a bitmap font; James Kass's CODE 2000 and CODE
> 2001 probably have the widest coverage of any font, though it costs US$5
> to use them.  Both of them IMHO are a tad on the ugly side.

In all fairness, the CODE 2000 font from James Kass is quite beautiful,
conceptually speaking. If the current execution is a tad ungainly here and
there, I ask 3 questions: (0) "What do you want for nothing (if you have
not yet paid the shareware fee)?"; (1) "What do you want for $5?"; and (2)
what do you want from a $5 shareware font that aspires to perfect coverage
of the *entire* BMP?

Code2000 is not open source, but Kass is remarkably responsive to user
input.

I urge everyone to download a copy of Code2000, and provide the developer
with feedback, both in terms of suggestions to improve the TrueType font,
and in terms of money to fund development.

http://home.att.net/~jameskass/

James is doing some great work, using some relatively low-level
programming tools. In my experience (admittedly somewhat limited, since I
don't care about *everything* in the BMP) his font works where other
fonts, professional and amature, completely fail. If a font has the glyph
you need in any form, that's far better than having a glyph of last
resort, or no glyph at all.

Disclaimer: I have no commercial relation to Kass, and have received no
compensation for this endorsement. This review should also not be taken as
expressing approval of the shape of any glyph in the Code2000 font,
especially the Capital Letter J, which I think even Kass himself has
called "quirky at best". Note however that the Code2000 "hexagram" block
characters do look quite nice, and better yet, they work in Adobe
Illustrator CS, though no one (neither Kass nor Adobe) seems to know why
yet :-)



Re: Nicest UTF

2004-12-02 Thread Philippe Verdy
If you need immutable strings, that take the least space as possible in 
memory for your running app, then consider using SCSU, for the internal 
storage of the string object, then have a method return an indexed array of 
code points, or a UTF-32 string when you need it to mutate the string object 
into another.

SCSU is excellent for immutable strings, and is a *very* tiny overhead above 
ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is extremely 
trivial, may be even simpler than to UTF-8!)

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
For internals of my language Kogut I've chosen a mixture of ISO-8859-1
and UTF-32. Normalized, i.e. a string with chracters which fit in
narrow characters is always stored in the narrow form.
I've chosen representations with fixed size code points because
nothing beats the simplicity of accessing characters by index, and the
most natural thing to index by is a code point.
Strings are immutable, so there is no need to upgrade or downgrade a
string in place, so having two representations doesn't hurt that much.
Since the majority of strings is ASCII, using UTF-32 for everything
would be wasteful.
Mutable and resizable character arrays use UTF-32 only.




RE: current version of unicode-font

2004-12-02 Thread Kevin Brown
Subject: RE: current version of unicode-font
On Thu, 2 Dec 2004 at 07:51:42 -0800, Peter Constable wrote:

>The most recently shipped version is 1.01, which ships with Office 2003.

... and Office 2004 doesn't ship with Arial Unicode MS at all!

Kevin




Re: Nicest UTF

2004-12-02 Thread Philippe Verdy
There's no *universal* best encoding.
UTF-8 however is certainly today the best encoding for portable 
communications and data storage (but it competes now with SCSU which uses a 
compressed form where, on average, each Unicode character is represented by 
one byte, in most documents; but other schemes also exist that use deflate 
compression on UTF-8).

The problem with UTF-16 and UTF-32 is byte ordering, where byte is meant in 
terms of portable networking and file storage, i.e. 8-bit in almost all 
current technologies. With UTF-16 and UTF-32, you need to get a way to 
determine how bytes are ordered in the code unit, as read from a 
byte-oriented stream. You need not with UTF-8.

The problem with UTF-8 is that it will be most often inefficient or not easy 
to work with within applications and libraries, that are easier accessing 
strings and counting characters coded on fixed-width code units.

Although UTF-16 is not strictly fixed-width, it is quite easy to work with, 
and is often more efficient than UTF-32 due to memory allocations.

UTF-32 however is the easiest solution when applications really want to 
handle each possible character encoded on one Unicode code point with a 
single code unit.

All UTF encodings (including the SCSU compressed encoding, or BOCU-8 which 
is a variant of UTF-8, or also now the GB18030 Chinese standard which is now 
a valid representation of Unicode) have their pros and cons.

Choose among them because they are widely documented, and offer good 
interoperabilities within lots of libraries handling them with similar 
semantics.

If you are not satisfied in your application by these encodings, you may 
even create your own one (like Sun did when modifying UTF-8 to allow 
representing any Unicode string within a null-terminated C string, and also 
allow any sequence of 16-bit code units, even the invalid ones where 
surrogates are unpaired, to be represented on 8-bit streams). If you do 
that, don't expect this encoding to be easily portable and recognized by 
other systems, unless you document it with a complete specification and make 
it available for free alternate implementations by others.

- Original Message - 
From: "Arcane Jill" <[EMAIL PROTECTED]>
To: "Unicode" <[EMAIL PROTECTED]>
Sent: Thursday, December 02, 2004 2:19 PM
Subject: RE: Nicest UTF


Oh for a chip with 21-bit wide registers!
:-)
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 02 December 2004 12:12
To: Unicode Mailing List
Subject: Re: Nicest UTF
There are other factors that might influence your choice.
For example, the relative cost of using 16-bit entities: on a Pentium it 
is
cheap, on more modern X86 processors the price is a bit higher, and on 
some
RISC chips it is prohibitive (that is, short may become 32 bits; 
obviously,
in such a case, UTF-16 is not really a good choice). On the other extreme,
you have processors where byte are 16 bits; obviously again, then UTF-8 is
not optimum there. ;-)






Re: current version of unicode-font

2004-12-02 Thread Andrew C. West
On Fri, 03 Dec 2004 00:38:25 +0700, Paul Hastings wrote:
> 
> John Cowan wrote:
> 
> > Googling for "free Unicode fonts" (no quotes) is useful.
> 
> sort of, when i've googled for this in the past, language-specific 
> (chinese seemed to be the most frequent) fonts turn up more often than 
> not. hey if you guys don't know, who does?
> 

As someone once said, Google is your friend, but if you don't have time to
google yourself these (and many other similar pages) may give you some useful
pointers :

http://www.alanwood.net/unicode/fonts.html
http://www.babelstone.co.uk/Fonts/Fonts.html

Andrew



Re: current version of unicode-font

2004-12-02 Thread Paul Hastings
John Cowan wrote:
In what, breadth of coverage or aesthetics?  The GNU Unifont has very
breadth mainly. i'm more interested in fonts for testing i18n web app 
output than looking "nice".

Googling for "free Unicode fonts" (no quotes) is useful.
sort of, when i've googled for this in the past, language-specific 
(chinese seemed to be the most frequent) fonts turn up more often than 
not. hey if you guys don't know, who does?





Re: Nicest UTF

2004-12-02 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes:

> Oh for a chip with 21-bit wide registers!

Not 21-bit but 20.087462841250343-bit :-)

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: current version of unicode-font

2004-12-02 Thread John Cowan
Paul Hastings scripsit:

> speaking of which, *are* there any open source fonts that come even 
> close to Arial Unicode MS?

In what, breadth of coverage or aesthetics?  The GNU Unifont has very
wide coverage though it is a bitmap font; James Kass's CODE 2000 and CODE
2001 probably have the widest coverage of any font, though it costs US$5
to use them.  Both of them IMHO are a tad on the ugly side.

Googling for "free Unicode fonts" (no quotes) is useful.

-- 
One Word to write them all, John Cowan <[EMAIL PROTECTED]>
  One Access to find them,  http://www.reutershealth.com
One Excel to count them all,http://www.ccil.org/~cowan
  And thus to Windows bind them.--Mike Champion



Re: current version of unicode-font

2004-12-02 Thread Paul Hastings
Peter Constable wrote:
Microsoft has never used the label 'OpenFont' for this or any of the
fonts that ship with their products.
speaking of which, *are* there any open source fonts that come even 
close to Arial Unicode MS?



RE: current version of unicode-font

2004-12-02 Thread Peter Constable
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Peter R. Mueller-Roemer

> I found some serious faults (with the implementation of short
sequences
> of combining diacritical marks, Greek and Hebrew with their accents
and
> points) with Arial Unicode MS version 1.00 (C) ...- 2000.
> I would like to test the newest version of this and other fonts, but
am
> reading that MS is not allowing / providing download of this font any
> more.
> 
> 1. Which is the currently most up-to-date version, and where can I
find it.

The most recently shipped version is 1.01, which ships with Office 2003.

 
> 2. If the newest version can only be had by buying new
Office-products,
> than the label 'OpenFont' is not deserved.

Microsoft has never used the label 'OpenFont' for this or any of the
fonts that ship with their products.



Peter Constable




RE: Nicest UTF

2004-12-02 Thread Arcane Jill
Oh for a chip with 21-bit wide registers!
:-)
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 02 December 2004 12:12
To: Unicode Mailing List
Subject: Re: Nicest UTF
There are other factors that might influence your choice.
For example, the relative cost of using 16-bit entities: on a Pentium it is
cheap, on more modern X86 processors the price is a bit higher, and on some
RISC chips it is prohibitive (that is, short may become 32 bits; obviously,
in such a case, UTF-16 is not really a good choice). On the other extreme,
you have processors where byte are 16 bits; obviously again, then UTF-8 is
not optimum there. ;-)


Re: Nicest UTF

2004-12-02 Thread Antoine Leca
On Wednesday, December 01, 2004 22:40Z Theodore H. Smith va escriure:

> Assuming you had no legacy code. And no "handy" libraries either,
> except for byte libraries in C (string.h, stdlib.h). Just a C++
> compiler, a "blank page" to draw on, and a requirement to do a lot of
> Unicode text processing.
<...>
> What would be the nicest UTF to use?

There are other factors that might influence your choice.
For example, the relative cost of using 16-bit entities: on a Pentium it is
cheap, on more modern X86 processors the price is a bit higher, and on some
RISC chips it is prohibitive (that is, short may become 32 bits; obviously,
in such a case, UTF-16 is not really a good choice). On the other extreme,
you have processors where byte are 16 bits; obviously again, then UTF-8 is
not optimum there. ;-)

Also, it may influence if you have write access to the sources for your
library: if yes, then it could be possible (at a minimal adaptation cost) to
use it to handle 16-bit ot 32-bit characters. Even more interesting, this
might already exist, in form of the wcs*() functions of the C95 Standard.

It also depends, obviously, on the kind of processing you are doing. Some
are mainly handling strings, so the transformation format is not the most
important thing. Yet others are handling characters, and then UTF-8 is less
adequate because of the cost of relocating. On the other hand texts are
stored in external files, and if the external format is UTF-8 or based on
it, then it might be a bias toward it.

And finally it may depend on how many different architectures you need to
deploy your programs. C is great for its portability, yet portability is a
tool, not a necessary target. An unique user usually does not care how
portable is the program he is using, provide it does the job and it results
cheap (or not too expensive). I agree portability is a good point for IT
managers (because it foments competition, with is good to cut costs.) But on
the other hand, too much portability can be counter-productive to everyone
(for example, writing a text processor in C which allows characters to be
stored directly as 8-bit as well as UTF-16 bytes. Or using long for
everything, in order to be potentially portable to 16-bit ints, even if the
storage limitations will impede practical use.)


I believe the current availability of 3 competitive formats is a fact that
we have to accept. It is certainly not as optimum as the prevalence of ASCII
may have been. It is certainly a bad thing for some suppliers such as those
that are writing those libraries, because it means ×3 work for them and an
augmentated price for their users (being in sales price or being in delay of
availability of features/bug corrections/etc.) Moreover, the present
existence of widely available yet incompatible installed bases for at least
two of the formats (namely UTF-16 on Windows NT and UTF-8 on Internet
protocols) means additional costs for about all the industry. This may mean
more workload for those that are actually working in this area ;-), but also
more pression upon them from part of their managements, and results in waste
when seen from the client side, so not a good thing for marketing.
Yet it is this way, and I assume we cannot do many things to cure that.

Now let's proceed to read the rest...


> I think UTF8 would be the nicest UTF.

So that is your point of view.


> But does UTF32 offer simpler better faster cleaner code?

Perhaps you can actually try to measure it.


> A Unicode "character" can be decomposed. Meaning that a character
> could still be a few variables of UTF32 code points! You'll still
> need to carry around "strings" of characters, instead of characters.

This sillogism is assuming that any text handling requires decomposition. I
disagree with this.


> The fact that it is totally bloat worthy, isn't so great. Bloat
> mongers aren't your friend.

Again, do you care to offer us any figures?


> The fact that it is incompatible with existing byte code doesn't help.

See above.

> UTF8 can be used with the existing byte libraries just fine.

It depends on what you want to do. For example, using strchr()/strspn() and
the like may be great if you are dealing with some sort of tagged formats
such as SGML; but if your text uses U+2028 as end-of-line indicator, it
suddently becomes not so great...

> An accented A in UTF-8, would be 3 bytes decomposed.

Or more.

> In UTF32, thats 8 bytes!

And so? Nobody is saying that UTF-32 is space efficient. In fact, UTF-32
specifically trade space against other advantages. If you are space-tight,
then obviously UTF-32 is not a choice. That is another constraint. Which you
did not add to the list above.

On the other hand, nowadays, the general use workstation used for text
processing has several hundred of megabytes of memory. That is, several
scores of megabytes of UTF-32 characters, decomposed and so on.
The biggest text I have at hand is below 15 M. And when I have to deal with
it, I am quite cl

current version of unicode-font

2004-12-02 Thread Peter R. Mueller-Roemer
I found some serious faults (with the implementation of short sequences 
of combining diacritical marks, Greek and Hebrew with their accents and 
points) with Arial Unicode MS version 1.00 (C) ...- 2000.
I would like to test the newest version of this and other fonts, but am 
reading that MS is not allowing / providing download of this font any 
more. 

1. Which is the currently most up-to-date version, and where can I find it.
2. If the newest version can only be had by buying new Office-products, 
than the label 'OpenFont' is not deserved.

3. Any experience with junicode, it claims compliance with unicode 4.0.  
Does it cover the range  -  of Unicode or at least combining 
diacritical marks, Greek and Hebrew) ?

Peter