Re: generation of gnu/java/locale/*.uni
Hi, On Mon, 2002-02-18 at 21:05, Eric Blake wrote: > > I just committed an update based on Artur's code. And since the > database is no longer a binary file, you should have no problems using > it straight out of CVS for another run through Mauve. Running it through mauve (jikes 1.15 + orp 1.0.9) still gives lots of failures for the Character tests: 168221 of 3603824 tests failed. Here are a couple of failures that are repeated often: FAIL: gnu.testlet.java.lang.Character.unicode: Character 0:UNDEFINED has wrong numeric value of -1 instead of 35 (number 1) (24209 times) FAIL: gnu.testlet.java.lang.Character.unicode: Character 0:UNDEFINED incorectly reported as javaidentifierpart (number 1) (14007 times) FAIL: gnu.testlet.java.lang.Character.unicode: Character 0:UNDEFINED incorectly reported as unicodeidentifierpart (number 1) (13976 times) FAIL: gnu.testlet.java.lang.Character.unicode: Character 24:UNDEFINED incorectly reported as javaindetifierstart (number 1) (13164 times) FAIL: gnu.testlet.java.lang.Character.unicode: Character 41:UNDEFINED incorectly reported as unicodeidentifierstart (number 1) (13122 times) FAIL: gnu.testlet.java.lang.Character.unicode: Character 1bb:UNDEFINED is reported to be type Lo instead of Cn (number 1) (11506 times) These could easily be bugs in Mauve though. Mauve uses its own UnicodeData.txt file. Cheers, Mark ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Artur Biesiadowski wrote: > > > > I just committed an update based on Artur's code. And since the > > database is no longer a binary file, you should have no problems using > > it straight out of CVS for another run through Mauve. > > Please note that this is not my code - it was written by Jochen Hoenicke. Sorry. I'll correct those references in CVS to give credit where it is due. > > I think that you might also find it useful: > > http://www.mail-archive.com/classpath@gnu.org/msg02024.html Hmm, more fun reading for me to do... > > I don't know how much the implementation differs now and unfortunately I > cannot make a check right now, but I suppose that playing with block > sizes should not be hard. The unicode-muncher.pl script checks all block sizes from 3-8 in selecting the best size. The best block size for Unicode 3.0.0 turns out to be 5; for Unicode 3.2.0 it is 4. -- This signature intentionally left boring. Eric Blake [EMAIL PROTECTED] BYU student, free software programmer ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Eric Blake wrote: > Brian Jones wrote: > >>I'll run what you've checked in through Mauve here and see what >>happens. Do you have time to evaluate the Character implementation >>Artur pointed to? I'm mostly concerned with correctness, I think the >>one he pointed to improved efficiency, if not speed. I'd do this >>myself but that would involve time learning how Character/Unicode work. >> > > I just committed an update based on Artur's code. And since the > database is no longer a binary file, you should have no problems using > it straight out of CVS for another run through Mauve. Please note that this is not my code - it was written by Jochen Hoenicke. I think that you might also find it useful: http://www.mail-archive.com/classpath@gnu.org/msg02024.html I don't know how much the implementation differs now and unfortunately I cannot make a check right now, but I suppose that playing with block sizes should not be hard. Thanks for commiting this stuff - it should make a lot of stuff working a lot faster... Artur ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Brian Jones wrote: > > I'll run what you've checked in through Mauve here and see what > happens. Do you have time to evaluate the Character implementation > Artur pointed to? I'm mostly concerned with correctness, I think the > one he pointed to improved efficiency, if not speed. I'd do this > myself but that would involve time learning how Character/Unicode work. I just committed an update based on Artur's code. And since the database is no longer a binary file, you should have no problems using it straight out of CVS for another run through Mauve. -- This signature intentionally left boring. Eric Blake [EMAIL PROTECTED] BYU student, free software programmer ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Artur Biesiadowski wrote: > > I know that I tell it every time Character issue comes - but aren't we > going to use alternative Character posted here some time ago (2 years?). > It had a LOT better performance (no object creation during check and > check itself quite faster) , encoded all data in Strings (so there was > no need to play tricks with loaders and dependency on gnu/something > class). Only problem I had was, I was unable to compile it with old > jikes due to some funny unicode. Hmm, I did find it kind of fishy that the current implementation was creating so many throwaway objects. However, while the changes I made further separate the two implementations, I'll look at incorporating the benefits I see in the alternative: 1. loading from Strings instead of Java File IO is nicer, making the static initialization slightly faster, but more importantly less dependent on other classes (note that I would still be using file IO, as the data must be in a separate file to be easily upgradeable; but implicitly through the VM ClassLoader and not explicitly) 2. caching all attributes in arrays requires more runtime memory, but if the arrays are compressed enough, this is a hands-down win over frequent object creation 3. character class checks, such as isLetter(), are more efficient, using a shift and single comparison to a constant instead of a series of conditional comparisons > > Even if it might be out-of-date and non-compilable today, it might be > nice to maybe work with it ? Or otherwise tell me that it WON'T be used > ever, so I can stop touching this subject every time Character is on > board :) Thanks for bringing up the issue. I hope to make some of those improvements in the near future. -- This signature intentionally left boring. Eric Blake [EMAIL PROTECTED] BYU student, free software programmer ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Eric Blake <[EMAIL PROTECTED]> writes: > Brian Jones wrote: > > > > As I recall Unicode now requires more bits than a Java 'char' allows. > > I don't know that helps at all? I don't really know what Sun's > > solution is. It looks like we did update to unicode data 3.0, but I > > know our implementation fails many Mauve tests related to Character. > > Unicode 3.1 introduced several code points in the surrogate space. And > the upcoming 3.2 adds even more. These characters require two 16-bit > fields to represent them (the first in \ud800 - \udb7f, the second in > \udc00 - \udfff). And Java does ignore these - the 4-byte abbreviation > sequences of UTF-8 are illegal in class files (you have to use a 6-byte > sequence instead), and Java identifiers may not include surrogate > characters. Sun would need to add more methods to the API to use them, > because the point of surrogates is that two characters together have > semantic meaning, while one alone is an error. For example, it is > impossible to tell if \ud820 in isolation is part of a letter, number, > or punctuation. So for now, Sun's "solution" is to stall. I did verify > today that JDK 1.4 is still on Unicode 3.0.0. > > The implementation of Character that I just checked in to Classpath is > identical in behavior to Sun's (fortunately, testing every method on all > 64k chars is not terribly time-consuming). However, I could not run it > through Mauve; as I still have been unable to compile a free VM on > cygwin, and Sun's VM doesn't like me replacing core classes like > Character. But if Character fails any tests in Mauve now, then I would > suspect that Mauve has the bugs. I'll run what you've checked in through Mauve here and see what happens. Do you have time to evaluate the Character implementation Artur pointed to? I'm mostly concerned with correctness, I think the one he pointed to improved efficiency, if not speed. I'd do this myself but that would involve time learning how Character/Unicode work. Brian -- Brian Jones <[EMAIL PROTECTED]> ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Brian Jones wrote: > > As I recall Unicode now requires more bits than a Java 'char' allows. > I don't know that helps at all? I don't really know what Sun's > solution is. It looks like we did update to unicode data 3.0, but I > know our implementation fails many Mauve tests related to Character. Unicode 3.1 introduced several code points in the surrogate space. And the upcoming 3.2 adds even more. These characters require two 16-bit fields to represent them (the first in \ud800 - \udb7f, the second in \udc00 - \udfff). And Java does ignore these - the 4-byte abbreviation sequences of UTF-8 are illegal in class files (you have to use a 6-byte sequence instead), and Java identifiers may not include surrogate characters. Sun would need to add more methods to the API to use them, because the point of surrogates is that two characters together have semantic meaning, while one alone is an error. For example, it is impossible to tell if \ud820 in isolation is part of a letter, number, or punctuation. So for now, Sun's "solution" is to stall. I did verify today that JDK 1.4 is still on Unicode 3.0.0. The implementation of Character that I just checked in to Classpath is identical in behavior to Sun's (fortunately, testing every method on all 64k chars is not terribly time-consuming). However, I could not run it through Mauve; as I still have been unable to compile a free VM on cygwin, and Sun's VM doesn't like me replacing core classes like Character. But if Character fails any tests in Mauve now, then I would suspect that Mauve has the bugs. -- This signature intentionally left boring. Eric Blake [EMAIL PROTECTED] BYU student, free software programmer ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Eric Blake wrote: > I'm looking at updating java.lang.Character to JDK 1.4 specs (and > Unicode 3.0). Does anyone know what programs generated > gnu/java/locale/*.uni? I ask because I need to add more information to > each character to cover the new DIRECTIONALITY category designations. I know that I tell it every time Character issue comes - but aren't we going to use alternative Character posted here some time ago (2 years?). It had a LOT better performance (no object creation during check and check itself quite faster) , encoded all data in Strings (so there was no need to play tricks with loaders and dependency on gnu/something class). Only problem I had was, I was unable to compile it with old jikes due to some funny unicode. Even if it might be out-of-date and non-compilable today, it might be nice to maybe work with it ? Or otherwise tell me that it WON'T be used ever, so I can stop touching this subject every time Character is on board :) Just in case: http://www.informatik.uni-oldenburg.de/~delwi/classpath/ Artur ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Eric Blake <[EMAIL PROTECTED]> writes: > Brian Jones wrote: > > > > doc/unicode/unicode-muncher.pl > > Thanks. Should I try to update Classpath to Unicode 3.2.0 (currently in > beta, expected to be final next month), 3.1.1 (the current stable > state), or 3.0 (the version mentioned in the 1.4 javadoc of > java.lang.Character)? > > JLS 3.1 states "The Java platform will track the Unicode specification > as it evolves. The precise version of Unicode used by a given release is > specified in the documentation of the class Character." I think the > choice is between documenting that we are using the latest definition > 3.2.0, or else sticking with 3.0.0 to match the behavior of Sun, but > want to know what others think. As I recall Unicode now requires more bits than a Java 'char' allows. I don't know that helps at all? I don't really know what Sun's solution is. It looks like we did update to unicode data 3.0, but I know our implementation fails many Mauve tests related to Character. Brian -- Brian Jones <[EMAIL PROTECTED]> ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Brian Jones wrote: > > doc/unicode/unicode-muncher.pl Thanks. Should I try to update Classpath to Unicode 3.2.0 (currently in beta, expected to be final next month), 3.1.1 (the current stable state), or 3.0 (the version mentioned in the 1.4 javadoc of java.lang.Character)? JLS 3.1 states "The Java platform will track the Unicode specification as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character." I think the choice is between documenting that we are using the latest definition 3.2.0, or else sticking with 3.0.0 to match the behavior of Sun, but want to know what others think. -- This signature intentionally left boring. Eric Blake [EMAIL PROTECTED] BYU student, free software programmer ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
Re: generation of gnu/java/locale/*.uni
Eric Blake <[EMAIL PROTECTED]> writes: > I'm looking at updating java.lang.Character to JDK 1.4 specs (and > Unicode 3.0). Does anyone know what programs generated > gnu/java/locale/*.uni? I ask because I need to add more information to > each character to cover the new DIRECTIONALITY category designations. > > If the generation program is not currently in the distribution, I'll > write up a quick Java program and add it to Classpath. doc/unicode/unicode-muncher.pl Brian -- Brian Jones <[EMAIL PROTECTED]> ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath
generation of gnu/java/locale/*.uni
I'm looking at updating java.lang.Character to JDK 1.4 specs (and Unicode 3.0). Does anyone know what programs generated gnu/java/locale/*.uni? I ask because I need to add more information to each character to cover the new DIRECTIONALITY category designations. If the generation program is not currently in the distribution, I'll write up a quick Java program and add it to Classpath. -- This signature intentionally left boring. Eric Blake [EMAIL PROTECTED] BYU student, free software programmer ___ Classpath mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/classpath