Does Java 1.5 support Unicode math alphanumerics as variable names?

2004-01-23 Thread Murray Sargent
Title: Does Java 1.5 support Unicode math alphanumerics as variable names?






E.g., math italic i (U+1D456)? With such usage, Java mathematical programs could look more like the original math.


Thanks

Murray





Re: Three new Technical Notes posted - Ada UTF-16

2004-01-23 Thread Markus Scherer
D. Starner wrote:
#12 UTF-16 for Processing
This is incorrect in saying that Ada uses UTF-16. It supports
UCS-2 only. The text of the standard says:
The predefined type Wide_Character is a character type 
whose values correspond to the 65536 code positions of 
the ISO 10646 Basic Multilingual Plane (BMP). [...]

which doesn't include surrogate code points. The next 
True, but not much different/worse than for Java, for example. Once you have 16-bit types and string 
literals, adding a few functions to deal with supplementary code points is not hard. We did this for 
Java in ICU4J.

There is little difference for a language between supporting UCS-2 or UTF-16 because where functions 
do not handle supplementary code points, they usually also don't handle Unicode versions above 3.0 - 
so string case mappings etc. are the same.

A language like that can be relatively easily upgraded to full UTF-16 handling by updating the 
character and string function implementations, and adding a few new APIs - that is what Java is 
doing. The upgrade is done naturally when the standard functions are extended to Unicode 3.1 or later.

As such, whether the strings contain UCS-2 or UTF-16 depends less on the language definition and 
more on the functions that are used, and the version of the standard libraries.

version of Ada will have 32-bit characters to fully
support Unicode - the text of the proposal is here:


plus lengthy discussion on the issues. 
Thank you very much for the link.

The proposal seems to be to continue to treat Wide strings as UCS-2, and to treat Wide_Wide strings 
(a new type) as UTF-32. This would give Ada a total of three different native string types on the 
language level. It would also mean that existing code, using 16-bit strings, would not benefit from 
an upgrade but would instead have to be rewritten for support of supplementary code points. This may 
in fact slow down such support.

There will be a presentation of the choices for Java (including UTF-32) at IUC 25.

Best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.


Re: Three new Technical Notes posted

2004-01-23 Thread D. Starner
>  #12 UTF-16 for Processing
>   by Markus Scherer

This is incorrect in saying that Ada uses UTF-16. It supports
UCS-2 only. The text of the standard says:

The predefined type Wide_Character is a character type 
whose values correspond to the 65536 code positions of 
the ISO 10646 Basic Multilingual Plane (BMP). [...] As 
with the other language-defined names for nongraphic characters, the names FFFE and 
 are usable only with 
the attributes (Wide_)Image and (Wide_)Value; they are 
not usable as enumeration literals. All other values of
Wide_Character are considered graphic characters, and 
have a corresponding character_literal. 

which doesn't include surrogate code points. The next 
version of Ada will have 32-bit characters to fully
support Unicode - the text of the proposal is here:



plus lengthy discussion on the issues. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




Three new Technical Notes posted

2004-01-23 Thread Rick McGowan
Three new Unicode Technical Notes are now available on the Unicode website.

The main Tech Notes page is here:
http://www.unicode.org/notes/

The new notes are:

  #11 Representing Myanmar in Unicode: Details and Examples
by Martin Hosken & Maung Tuntunlwin

  #12 UTF-16 for Processing
by Markus Scherer

  #13 GDP by Language
by Mark Davis

The notes are all accessible through the left-side "navigation bar" on the  
main Tech Notes page.

Regards,
Rick



Re: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Markus Scherer
Doug Ewell wrote:
Markus Scherer  wrote:
"claim"? That hurts...

I did measure these things, and the numbers in the table are all from
my measurements. I also included the type of machine I used, etc.
(http://www.unicode.org/notes/tn6/#Performance)
Certainly I would never accuse Markus of falsifying these statistics.
The word "claim" was not meant in the sense of "unsubstantiated claim."
I might have overreacted a little here. I am not in _excruciating_ pain ;-)
Sorry for misunderstanding "claim". My only excuse is that I am not a native speaker.
I'll have to see how my encoder and decoder perform when I finish them.
They're currently written for simplicity, not speed.
My initial implementations were slower, too. I worked quite a bit on the performance of the 
converters that are in ICU4C.

UTF-8 is useful because it's simple, and supported just about
everywhere - but it's otherwise hardly optimal for anything.
As John said, it's all about ASCII transparency, together with no false
positives for "ASCII bytes" in non-Basic Latin characters.
I agree with this, of course - in my mind, it's part of the "supported just about everywhere".

A good part of what makes ASCII transparency useful for HTML and XML and other formats with internal 
encoding declarations is that one can parse those encoding declarations by initially assuming an 
ASCII-compatible encoding.

It would be less important if Unicode signatures (BOMs) were used and recognized more often.

markus



Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Jon Hanna
Quoting Philippe Verdy <[EMAIL PROTECTED]>:

> From: "Jon Hanna" <[EMAIL PROTECTED]>
> > Quoting Marco Cimarosti <[EMAIL PROTECTED]>:
> >
> > > Jon Hanna wrote:
> > > > I refuse to rename my UTF-81920!
> > >
> > > Doug, Shlomi, there's a new one out there!
> > > Jon, would you mind describing it?
> >
> > There are two different UTF-81920s (the resultant ambiguity is very much
> in the
> > spirit of UTF-81920).
> 
> I can't find any reference document about "UTF-81920" in Google.

That's because there are no documents about UTF-81920. It barely qualifies as
the starting point of a gedankenexperiment, never mind as a spec. That's why
this thread is marked as OT. The closest thing to a spec is the email I just
sent to this list.

> All I can find is documents describing "UTF-8" which encodes 128 characters
> on 1 byte, and 1920 characters on 2 bytes.

Excellent, the inclusion of "1920" in the name is then wonderfully
serendipitous.

> Does it mean that UTF-81920 is a restriction of UTF-8 to the range
> [U+..U+007FF] which can be encoded with at most 2 bytes with UTF-8?

No, it is as explained in the email.

> UTF-81920 would then effectively not be a Unicode-compatible encoding scheme
> as it would be restricted to only Latin, Greek, Coptic, Cyrillic, Armenian,
> Hebrew and Arabic with their diacritics, excluding all Asian scripts,
> surrogates, and compatibility characters, Arabic/Hebrew extension, common
> ligatures like "fi" and presentation forms, as well as currency signs (such
> as the Euro symbol coded at U+20AC), technical symbols, and even the BOM
> U+FEFF? This encoding does not seem suitable to even represent successfully
> the legacy DOS/OEM codepages, or the legacy PostScript and Mac charsets.

Yes, day-dream concepts mentioned in jest do often have technical
short-comings.

-- 
Jon Hanna

*Thought provoking quote goes here*



Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Philippe Verdy
From: "Jon Hanna" <[EMAIL PROTECTED]>
> Quoting Marco Cimarosti <[EMAIL PROTECTED]>:
>
> > Jon Hanna wrote:
> > > I refuse to rename my UTF-81920!
> >
> > Doug, Shlomi, there's a new one out there!
> > Jon, would you mind describing it?
>
> There are two different UTF-81920s (the resultant ambiguity is very much
in the
> spirit of UTF-81920).

I can't find any reference document about "UTF-81920" in Google.

All I can find is documents describing "UTF-8" which encodes 128 characters
on 1 byte, and 1920 characters on 2 bytes.

Does it mean that UTF-81920 is a restriction of UTF-8 to the range
[U+..U+007FF] which can be encoded with at most 2 bytes with UTF-8?

UTF-81920 would then effectively not be a Unicode-compatible encoding scheme
as it would be restricted to only Latin, Greek, Coptic, Cyrillic, Armenian,
Hebrew and Arabic with their diacritics, excluding all Asian scripts,
surrogates, and compatibility characters, Arabic/Hebrew extension, common
ligatures like "fi" and presentation forms, as well as currency signs (such
as the Euro symbol coded at U+20AC), technical symbols, and even the BOM
U+FEFF? This encoding does not seem suitable to even represent successfully
the legacy DOS/OEM codepages, or the legacy PostScript and Mac charsets.




[OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Jon Hanna
Quoting Marco Cimarosti <[EMAIL PROTECTED]>:

> Jon Hanna wrote:
> > I refuse to rename my UTF-81920!
> 
> Doug, Shlomi, there's a new one out there!
> 
> Jon, would you mind describing it?

There are two different UTF-81920s (the resultant ambiguity is very much in the
spirit of UTF-81920).

The first is not only not a proper UTF, but it is not Unicode at all; rather
it's science fiction.
The expected lifetime of Unicode was mentioned a while back which set me
thinking about what could go beyond Unicode and why. Hypothesising a massive
increase in computing power, bandwidth and other technological limitations I
imagined an expert system that could read a piece of text much as an expert in
linguistics, typography, calligraphy and disciplines related to the text might.
Two such systems would not communicate with each other only in terms of
characters, but also in terms of descriptions of characters and so "c with a
downwards-pointing triangle found in some Chumash text, possibly a fancified
hacek, possibly something else" (from a recent post to this list) could be
"encoded" so to speak. Since a detail description of a character, especially
one that could not be reliably compared with a known character, could
potentially be quite large I picked the figure of 10KB out of the air and hence
UTF-81920!

The second idea is a possibly practical one inspired by the above flight of
fancy - of a Wiki or similar of information about the various characters
encoded in Unicode, or proposed characters, or even a note on why the reserve
code points U+2072 and U+2073 aren't superscript 2 and superscript 3, encoding
histories and so on. I wouldn't be able to contribute to such a project (and
most of those who would are very busy), but I'd certainly enjoy flicking
through it if it existed.

-- 
Jon Hanna

*Thought provoking quote goes here*



RE: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Marco Cimarosti
Jon Hanna wrote:
> I refuse to rename my UTF-81920!

Doug, Shlomi, there's a new one out there!

Jon, would you mind describing it?

_ Marco



Re: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Jon Hanna
> By the way, I don't think that there's an official reference that attributes
> the acronym "UTF-9" to any of these encoding forms. I think that if "UTF-9"
> is used it should be agreed by Unicode as being an official unique
> representation. 

I refuse to rename my UTF-81920!

-- 
Jon Hanna

*Thought provoking quote goes here*



Re: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Doug Ewell
Markus Scherer  wrote:

>> BOCU-1 might solve this problem, but multiplying and dividing by 243
>> doesn't sound faster than UTF-8 bit-shifting.  (I'm still amazed by
>> the claim in UTN #6 that converting Hindi text between UTF-16 and
>> BOCU-1 took only 45% as long as converting it between UTF-16 and
>> UTF-8.)
>
> "claim"? That hurts...
>
> I did measure these things, and the numbers in the table are all from
> my measurements. I also included the type of machine I used, etc.
> (http://www.unicode.org/notes/tn6/#Performance)

Certainly I would never accuse Markus of falsifying these statistics.
The word "claim" was not meant in the sense of "unsubstantiated claim."

It did startle me that converting to BOCU-1 and SCSU could be TWICE as
fast as converting to UTF-8, unless the I/O cost of writing two or three
bytes is *much* slower than that of writing only one.

> The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that
> BOCU-1 goes into single-byte mode for small scripts like Hindi.
> Single-byte mode only performs a subtraction, no div/mod or even bit-
> shifting, and writes/reads only one byte per character. It is also
> optimized in ICU with a tight inner loop.

I'll have to see how my encoder and decoder perform when I finish them.
They're currently written for simplicity, not speed.

> UTF-8 is useful because it's simple, and supported just about
> everywhere - but it's otherwise hardly optimal for anything.

As John said, it's all about ASCII transparency, together with no false
positives for "ASCII bytes" in non-Basic Latin characters.

> If you want high-speed, compact encoding, use SCSU. If you want good
> speed, compact encoding, and binary order and/or MIME compatibility,
> use BOCU-1. Make sure that both sides of the wire know what's going
> across.

Always.  And especially in the case of BOCU-1, since it's not
ASCII-transparent -- although heuristic detection of BOCU-1 should be
straightforward and very reliable.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Doug Ewell
Kenneth Whistler  wrote:

>> I have seen several other informal proposals for "UTF-*" forms/
>> schemes. All this is just confusive, and their authors should imagine
>> their own names for reference. What do you think of this idea?
>
> It is, indeed, "confusive". Some of us have deliberately contributed
> to the confusion with tongue-in-cheek additions. See my own
> UTF-17 (draft-whistler-utf17-00.txt). I would not object if
> henceforward people referred to that as KW-UTF-17, to avoid
> confusion. :-)

A couple of years ago I suggested calling these "XTF's," to distinguish
them from the official "UTF's."

I've added a bit to the confusivity, with ha-ha-only-serious schemes
called DUCK (Doug's Unicode Compression Kludge) and MUCK (Multigraph
Unicode Compression Kludge), plus something I called "dynamic code
pages" which never saw the light of day, and probably never will because
of their really, really bad performance.

But mostly I've carried other people's jokes (and serious proposals) to
the logical extreme and beyond, by creating fully functional and tested
implementations of:

â UTF-4 by Jill Ramonsky (name provided by John Cowan)
â UTF-5 by James Seng, Martin DÃrst, and Tin Wee Tan
â UTF-7d5 by JÃrg Knappen
â UTF-8C1 by Markus Scherer
â UTF-9 by Jerome Abela (not Mark Crispin's version)
â UTF-17 by Ken
â UTF-24 by Pim Blokland
â UTF-64 by Marco and Paul KeinÃnen
â UTF-mu by Marco
â UTF-Z by Marco
â XTF-3 by Shlomi Tal

as well as some more serious formats:

â UTF-1 (the "original" Unicode Transformation Format)
â UTF-EBCDIC (described in meticulous detail in UTR #16)

Currently I'm working on a much more useful project: a "clean-room"
encoder and decoder for BOCU-1, possibly the world's first that doesn't
just wrap the UTN #6 sample code.

And on the lighter side, I recently dredged up Misha Wolf's original
1995 description of RCSU, the predecessor of SCSU, and started
experimenting with an encoder and decoder:

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.htm
l

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/