Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Artur Biesiadowski

Eric Blake wrote:

> I'm looking at updating java.lang.Character to JDK 1.4 specs (and
> Unicode 3.0).  Does anyone know what programs generated
> gnu/java/locale/*.uni?  I ask because I need to add more information to
> each character to cover the new DIRECTIONALITY category designations.


I know that I tell it every time Character issue comes - but aren't we
going to use alternative Character posted here some time ago (2 years?).
It had a LOT better performance (no object creation during check and
check itself quite faster) , encoded all data in Strings (so there was
no need to play tricks with loaders and dependency on gnu/something
class). Only problem I had was, I was unable to compile it with old
jikes due to some funny unicode.

Even if it might be out-of-date and non-compilable today, it might be
nice to maybe work with it ? Or otherwise tell me that it WON'T be used
ever, so I can stop touching this subject every time Character is on
board :)

Just in case:
http://www.informatik.uni-oldenburg.de/~delwi/classpath/


Artur





___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Eric Blake

Brian Jones wrote:
> 
> As I recall Unicode now requires more bits than a Java 'char' allows.
> I don't know that helps at all?  I don't really know what Sun's
> solution is.  It looks like we did update to unicode data 3.0, but I
> know our implementation fails many Mauve tests related to Character.

Unicode 3.1 introduced several code points in the surrogate space.  And
the upcoming 3.2 adds even more.  These characters require two 16-bit
fields to represent them (the first in \ud800 - \udb7f, the second in
\udc00 - \udfff).  And Java does ignore these - the 4-byte abbreviation
sequences of UTF-8 are illegal in class files (you have to use a 6-byte
sequence instead), and Java identifiers may not include surrogate
characters.  Sun would need to add more methods to the API to use them,
because the point of surrogates is that two characters together have
semantic meaning, while one alone is an error.  For example, it is
impossible to tell if \ud820 in isolation is part of a letter, number,
or punctuation.  So for now, Sun's "solution" is to stall.  I did verify
today that JDK 1.4 is still on Unicode 3.0.0.

The implementation of Character that I just checked in to Classpath is
identical in behavior to Sun's (fortunately, testing every method on all
64k chars is not terribly time-consuming).  However, I could not run it
through Mauve; as I still have been unable to compile a free VM on
cygwin, and Sun's VM doesn't like me replacing core classes like
Character.  But if Character fails any tests in Mauve now, then I would
suspect that Mauve has the bugs.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer


___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Brian Jones

Eric Blake <[EMAIL PROTECTED]> writes:

> Brian Jones wrote:
> > 
> > As I recall Unicode now requires more bits than a Java 'char' allows.
> > I don't know that helps at all?  I don't really know what Sun's
> > solution is.  It looks like we did update to unicode data 3.0, but I
> > know our implementation fails many Mauve tests related to Character.
> 
> Unicode 3.1 introduced several code points in the surrogate space.  And
> the upcoming 3.2 adds even more.  These characters require two 16-bit
> fields to represent them (the first in \ud800 - \udb7f, the second in
> \udc00 - \udfff).  And Java does ignore these - the 4-byte abbreviation
> sequences of UTF-8 are illegal in class files (you have to use a 6-byte
> sequence instead), and Java identifiers may not include surrogate
> characters.  Sun would need to add more methods to the API to use them,
> because the point of surrogates is that two characters together have
> semantic meaning, while one alone is an error.  For example, it is
> impossible to tell if \ud820 in isolation is part of a letter, number,
> or punctuation.  So for now, Sun's "solution" is to stall.  I did verify
> today that JDK 1.4 is still on Unicode 3.0.0.
> 
> The implementation of Character that I just checked in to Classpath is
> identical in behavior to Sun's (fortunately, testing every method on all
> 64k chars is not terribly time-consuming).  However, I could not run it
> through Mauve; as I still have been unable to compile a free VM on
> cygwin, and Sun's VM doesn't like me replacing core classes like
> Character.  But if Character fails any tests in Mauve now, then I would
> suspect that Mauve has the bugs.

I'll run what you've checked in through Mauve here and see what
happens.  Do you have time to evaluate the Character implementation
Artur pointed to?  I'm mostly concerned with correctness, I think the
one he pointed to improved efficiency, if not speed.  I'd do this
myself but that would involve time learning how Character/Unicode work.

Brian
-- 
Brian Jones <[EMAIL PROTECTED]>

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Eric Blake

Artur Biesiadowski wrote:
> 
> I know that I tell it every time Character issue comes - but aren't we
> going to use alternative Character posted here some time ago (2 years?).
> It had a LOT better performance (no object creation during check and
> check itself quite faster) , encoded all data in Strings (so there was
> no need to play tricks with loaders and dependency on gnu/something
> class). Only problem I had was, I was unable to compile it with old
> jikes due to some funny unicode.

Hmm, I did find it kind of fishy that the current implementation was
creating so many throwaway objects.  However, while the changes I made
further separate the two implementations, I'll look at incorporating the
benefits I see in the alternative:

1. loading from Strings instead of Java File IO is nicer, making the
static initialization slightly faster, but more importantly less
dependent on other classes (note that I would still be using file IO, as
the data must be in a separate file to be easily upgradeable; but
implicitly through the VM ClassLoader and not explicitly)

2. caching all attributes in arrays requires more runtime memory, but if
the arrays are compressed enough, this is a hands-down win over frequent
object creation

3. character class checks, such as isLetter(), are more efficient, using
a shift and single comparison to a constant instead of a series of
conditional comparisons

> 
> Even if it might be out-of-date and non-compilable today, it might be
> nice to maybe work with it ? Or otherwise tell me that it WON'T be used
> ever, so I can stop touching this subject every time Character is on
> board :)

Thanks for bringing up the issue.  I hope to make some of those
improvements in the near future.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Mauve HOWTO

2002-02-17 Thread Brian Jones

This document is in rough form, but I figured someone might find it
useful.

Mauve HOWTO

Set JAVAC, JAVA environment variables

export JAVA=orp
export JAVAC=jikes

(orp is a shell script that calls the real orp with some arguments)

#!/bin/sh
~/orp-1.0.9/mains/orp/Linux/dbg/orp -swapjit 0 1 -classpath \
$CLASSPATH $*

Configure and create Makefile

./configure

Run All Tests

make KEYS=classpath check

Or Run a Single Test

echo "gnu.testlet.java.io.File.jdk11" | \
orp gnu.testlet.SimpleTestHarness

To recompile, delete classes.stamp
To reconfigure classes based on KEYS input, delete choices and classes
To change JAVA or JAVAC, change environment and rerun configure

-- 
Brian Jones <[EMAIL PROTECTED]>

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: classpath ./ChangeLog java/lang/Character.java ...

2002-02-17 Thread Tom Tromey

> "Eric" == Eric Blake <[EMAIL PROTECTED]> writes:

Eric> Added files:
Eric>   doc/unicode: Blocks-3.txt ReadMe-3.0.0.txt 
Eric>UnicodeData-3.0.0.html UnicodeData-3.0.0.txt 
Eric>unicode-blocks.pl 

Last time I looked we were allowed to use the Unicode data tables, and
distribute files generated from them, but we couldn't distribute the
tables themselves.  These restrictions were set by the Unicode
Consortium, and I found them somewhere on the web site.  Has this
changed?

Tom

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: classpath ./ChangeLog java/lang/Character.java ...

2002-02-17 Thread Eric Blake

Tom Tromey wrote:
> 
> Last time I looked we were allowed to use the Unicode data tables, and
> distribute files generated from them, but we couldn't distribute the
> tables themselves.  These restrictions were set by the Unicode
> Consortium, and I found them somewhere on the web site.  Has this
> changed?

Hmm, I'll look into that.  I just committed the 3.0.0 files since the
2.1.2 files were already in CVS, but if I need to, I'll remove the
documents and replace them with a HOWTO on obtaining them from the
source.  I'll be checking the Unicode site for their license policies.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer


___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: classpath ./ChangeLog java/lang/Character.java ...

2002-02-17 Thread Eric Blake

I found the license, and am committing that file to CVS as well.  As I
read it, we are perfectly justified in including the UnicodeData files
in CVS and our distributions.

"Recipient is granted the right to make copies in any form for internal
distribution and to freely use the information supplied in the creation
of products supporting the UnicodeTM Standard. The files in the Unicode
Character Database can be redistributed to third parties or other
organizations (whether for profit or not) as long as this notice and the
disclaimer notice are retained. Information can be extracted from these
files and used in documentation or programs, as long as there is an
accompanying notice indicating the source."

Tom Tromey wrote:
>
> Last time I looked we were allowed to use the Unicode data tables, and
> distribute files generated from them, but we couldn't distribute the
> tables themselves.  These restrictions were set by the Unicode
> Consortium, and I found them somewhere on the web site.  Has this
> changed?

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath



Re: classpath ./ChangeLog java/lang/Character.java ...

2002-02-17 Thread Tom Tromey

> "Eric" == Eric Blake <[EMAIL PROTECTED]> writes:

Eric> I found the license, and am committing that file to CVS as well.
Eric> As I read it, we are perfectly justified in including the
Eric> UnicodeData files in CVS and our distributions.

Thanks for following up on this.

Tom

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath