subject:"generation of gnu\/java\/locale\/\*.uni"

Re: generation of gnu/java/locale/*.uni

2002-02-18 Thread Mark Wielaard


Hi,

On Mon, 2002-02-18 at 21:05, Eric Blake wrote:
> 
> I just committed an update based on Artur's code.  And since the
> database is no longer a binary file, you should have no problems using
> it straight out of CVS for another run through Mauve.

Running it through mauve (jikes 1.15 + orp 1.0.9) still gives lots of
failures for the Character tests: 168221 of 3603824 tests failed.
Here are a couple of failures that are repeated often:

FAIL: gnu.testlet.java.lang.Character.unicode: Character 0:UNDEFINED has
wrong numeric value of -1 instead of 35 (number 1) (24209 times)

FAIL: gnu.testlet.java.lang.Character.unicode: Character 0:UNDEFINED
incorectly reported as javaidentifierpart (number 1) (14007 times)

FAIL: gnu.testlet.java.lang.Character.unicode: Character 0:UNDEFINED
incorectly reported as unicodeidentifierpart (number 1) (13976 times)

FAIL: gnu.testlet.java.lang.Character.unicode: Character 24:UNDEFINED
incorectly reported as javaindetifierstart (number 1) (13164 times)

FAIL: gnu.testlet.java.lang.Character.unicode: Character 41:UNDEFINED
incorectly reported as unicodeidentifierstart (number 1) (13122 times)

FAIL: gnu.testlet.java.lang.Character.unicode: Character 1bb:UNDEFINED
is reported to be type Lo instead of Cn (number 1) (11506 times)

These could easily be bugs in Mauve though. Mauve uses its own
UnicodeData.txt file.

Cheers,

Mark

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-18 Thread Eric Blake


Artur Biesiadowski wrote:
> >
> > I just committed an update based on Artur's code.  And since the
> > database is no longer a binary file, you should have no problems using
> > it straight out of CVS for another run through Mauve.
> 
> Please note that this is not my code - it was written by Jochen Hoenicke.

Sorry. I'll correct those references in CVS to give credit where it is
due.

> 
> I think that you might also find it useful:
> 
> http://www.mail-archive.com/classpath@gnu.org/msg02024.html

Hmm, more fun reading for me to do...

> 
> I don't know how much the implementation differs now and unfortunately I
> cannot make a check right now, but I suppose that playing with block
> sizes should not be hard.

The unicode-muncher.pl script checks all block sizes from 3-8 in
selecting the best size.  The best block size for Unicode 3.0.0 turns
out to be 5; for Unicode 3.2.0 it is 4.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-18 Thread Artur Biesiadowski

Eric Blake wrote:
> Brian Jones wrote:
> 
>>I'll run what you've checked in through Mauve here and see what
>>happens.  Do you have time to evaluate the Character implementation
>>Artur pointed to?  I'm mostly concerned with correctness, I think the
>>one he pointed to improved efficiency, if not speed.  I'd do this
>>myself but that would involve time learning how Character/Unicode work.
>>
> 
> I just committed an update based on Artur's code.  And since the
> database is no longer a binary file, you should have no problems using
> it straight out of CVS for another run through Mauve.

Please note that this is not my code - it was written by Jochen Hoenicke.

I think that you might also find it useful:

http://www.mail-archive.com/classpath@gnu.org/msg02024.html

I don't know how much the implementation differs now and unfortunately I 
cannot make a check right now, but I suppose that playing with block 
sizes should not be hard.

Thanks for commiting this stuff - it should make a lot of stuff working 
a lot faster...

Artur

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-18 Thread Eric Blake


Brian Jones wrote:
> 
> I'll run what you've checked in through Mauve here and see what
> happens.  Do you have time to evaluate the Character implementation
> Artur pointed to?  I'm mostly concerned with correctness, I think the
> one he pointed to improved efficiency, if not speed.  I'd do this
> myself but that would involve time learning how Character/Unicode work.

I just committed an update based on Artur's code.  And since the
database is no longer a binary file, you should have no problems using
it straight out of CVS for another run through Mauve.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Eric Blake

Artur Biesiadowski wrote:
> 
> I know that I tell it every time Character issue comes - but aren't we
> going to use alternative Character posted here some time ago (2 years?).
> It had a LOT better performance (no object creation during check and
> check itself quite faster) , encoded all data in Strings (so there was
> no need to play tricks with loaders and dependency on gnu/something
> class). Only problem I had was, I was unable to compile it with old
> jikes due to some funny unicode.

Hmm, I did find it kind of fishy that the current implementation was
creating so many throwaway objects.  However, while the changes I made
further separate the two implementations, I'll look at incorporating the
benefits I see in the alternative:

1. loading from Strings instead of Java File IO is nicer, making the
static initialization slightly faster, but more importantly less
dependent on other classes (note that I would still be using file IO, as
the data must be in a separate file to be easily upgradeable; but
implicitly through the VM ClassLoader and not explicitly)

2. caching all attributes in arrays requires more runtime memory, but if
the arrays are compressed enough, this is a hands-down win over frequent
object creation

3. character class checks, such as isLetter(), are more efficient, using
a shift and single comparison to a constant instead of a series of
conditional comparisons

> 
> Even if it might be out-of-date and non-compilable today, it might be
> nice to maybe work with it ? Or otherwise tell me that it WON'T be used
> ever, so I can stop touching this subject every time Character is on
> board :)

Thanks for bringing up the issue.  I hope to make some of those
improvements in the near future.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Brian Jones


Eric Blake <[EMAIL PROTECTED]> writes:

> Brian Jones wrote:
> > 
> > As I recall Unicode now requires more bits than a Java 'char' allows.
> > I don't know that helps at all?  I don't really know what Sun's
> > solution is.  It looks like we did update to unicode data 3.0, but I
> > know our implementation fails many Mauve tests related to Character.
> 
> Unicode 3.1 introduced several code points in the surrogate space.  And
> the upcoming 3.2 adds even more.  These characters require two 16-bit
> fields to represent them (the first in \ud800 - \udb7f, the second in
> \udc00 - \udfff).  And Java does ignore these - the 4-byte abbreviation
> sequences of UTF-8 are illegal in class files (you have to use a 6-byte
> sequence instead), and Java identifiers may not include surrogate
> characters.  Sun would need to add more methods to the API to use them,
> because the point of surrogates is that two characters together have
> semantic meaning, while one alone is an error.  For example, it is
> impossible to tell if \ud820 in isolation is part of a letter, number,
> or punctuation.  So for now, Sun's "solution" is to stall.  I did verify
> today that JDK 1.4 is still on Unicode 3.0.0.
> 
> The implementation of Character that I just checked in to Classpath is
> identical in behavior to Sun's (fortunately, testing every method on all
> 64k chars is not terribly time-consuming).  However, I could not run it
> through Mauve; as I still have been unable to compile a free VM on
> cygwin, and Sun's VM doesn't like me replacing core classes like
> Character.  But if Character fails any tests in Mauve now, then I would
> suspect that Mauve has the bugs.

I'll run what you've checked in through Mauve here and see what
happens.  Do you have time to evaluate the Character implementation
Artur pointed to?  I'm mostly concerned with correctness, I think the
one he pointed to improved efficiency, if not speed.  I'd do this
myself but that would involve time learning how Character/Unicode work.

Brian
-- 
Brian Jones <[EMAIL PROTECTED]>

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Eric Blake

Brian Jones wrote:
> 
> As I recall Unicode now requires more bits than a Java 'char' allows.
> I don't know that helps at all?  I don't really know what Sun's
> solution is.  It looks like we did update to unicode data 3.0, but I
> know our implementation fails many Mauve tests related to Character.

Unicode 3.1 introduced several code points in the surrogate space.  And
the upcoming 3.2 adds even more.  These characters require two 16-bit
fields to represent them (the first in \ud800 - \udb7f, the second in
\udc00 - \udfff).  And Java does ignore these - the 4-byte abbreviation
sequences of UTF-8 are illegal in class files (you have to use a 6-byte
sequence instead), and Java identifiers may not include surrogate
characters.  Sun would need to add more methods to the API to use them,
because the point of surrogates is that two characters together have
semantic meaning, while one alone is an error.  For example, it is
impossible to tell if \ud820 in isolation is part of a letter, number,
or punctuation.  So for now, Sun's "solution" is to stall.  I did verify
today that JDK 1.4 is still on Unicode 3.0.0.

The implementation of Character that I just checked in to Classpath is
identical in behavior to Sun's (fortunately, testing every method on all
64k chars is not terribly time-consuming).  However, I could not run it
through Mauve; as I still have been unable to compile a free VM on
cygwin, and Sun's VM doesn't like me replacing core classes like
Character.  But if Character fails any tests in Mauve now, then I would
suspect that Mauve has the bugs.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-17 Thread Artur Biesiadowski

Eric Blake wrote:

> I'm looking at updating java.lang.Character to JDK 1.4 specs (and
> Unicode 3.0).  Does anyone know what programs generated
> gnu/java/locale/*.uni?  I ask because I need to add more information to
> each character to cover the new DIRECTIONALITY category designations.

I know that I tell it every time Character issue comes - but aren't we
going to use alternative Character posted here some time ago (2 years?).
It had a LOT better performance (no object creation during check and
check itself quite faster) , encoded all data in Strings (so there was
no need to play tricks with loaders and dependency on gnu/something
class). Only problem I had was, I was unable to compile it with old
jikes due to some funny unicode.

Even if it might be out-of-date and non-compilable today, it might be
nice to maybe work with it ? Or otherwise tell me that it WON'T be used
ever, so I can stop touching this subject every time Character is on
board :)

Just in case:
http://www.informatik.uni-oldenburg.de/~delwi/classpath/

Artur

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-16 Thread Brian Jones


Eric Blake <[EMAIL PROTECTED]> writes:

> Brian Jones wrote:
> > 
> > doc/unicode/unicode-muncher.pl
> 
> Thanks.  Should I try to update Classpath to Unicode 3.2.0 (currently in
> beta, expected to be final next month), 3.1.1 (the current stable
> state), or 3.0 (the version mentioned in the 1.4 javadoc of
> java.lang.Character)?  
> 
> JLS 3.1 states "The Java platform will track the Unicode specification
> as it evolves. The precise version of Unicode used by a given release is
> specified in the documentation of the class Character."  I think the
> choice is between documenting that we are using the latest definition
> 3.2.0, or else sticking with 3.0.0 to match the behavior of Sun, but
> want to know what others think.

As I recall Unicode now requires more bits than a Java 'char' allows.
I don't know that helps at all?  I don't really know what Sun's
solution is.  It looks like we did update to unicode data 3.0, but I
know our implementation fails many Mauve tests related to Character.

Brian
-- 
Brian Jones <[EMAIL PROTECTED]>

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-16 Thread Eric Blake

Brian Jones wrote:
> 
> doc/unicode/unicode-muncher.pl

Thanks.  Should I try to update Classpath to Unicode 3.2.0 (currently in
beta, expected to be final next month), 3.1.1 (the current stable
state), or 3.0 (the version mentioned in the 1.4 javadoc of
java.lang.Character)?  

JLS 3.1 states "The Java platform will track the Unicode specification
as it evolves. The precise version of Unicode used by a given release is
specified in the documentation of the class Character."  I think the
choice is between documenting that we are using the latest definition
3.2.0, or else sticking with 3.0.0 to match the behavior of Sun, but
want to know what others think.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

2002-02-16 Thread Brian Jones


Eric Blake <[EMAIL PROTECTED]> writes:

> I'm looking at updating java.lang.Character to JDK 1.4 specs (and
> Unicode 3.0).  Does anyone know what programs generated
> gnu/java/locale/*.uni?  I ask because I need to add more information to
> each character to cover the new DIRECTIONALITY category designations.
> 
> If the generation program is not currently in the distribution, I'll
> write up a quick Java program and add it to Classpath.

doc/unicode/unicode-muncher.pl

Brian
-- 
Brian Jones <[EMAIL PROTECTED]>

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

generation of gnu/java/locale/*.uni

2002-02-16 Thread Eric Blake


I'm looking at updating java.lang.Character to JDK 1.4 specs (and
Unicode 3.0).  Does anyone know what programs generated
gnu/java/locale/*.uni?  I ask because I need to add more information to
each character to cover the new DIRECTIONALITY category designations.

If the generation program is not currently in the distribution, I'll
write up a quick Java program and add it to Classpath.

-- 
This signature intentionally left boring.

Eric Blake [EMAIL PROTECTED]
  BYU student, free software programmer

___
Classpath mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/classpath

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

Re: generation of gnu/java/locale/*.uni

generation of gnu/java/locale/*.uni

12 matches

Site Navigation

Mail list logo

Footer information