Re: [kaffe] Slow byte to char conversion

2000-08-30 Thread Dalibor Topic


Am Die, 29 Aug 2000 schrieb Dalibor Topic:

> > Is it not fair to assume that converting n bytes will result in less than
> > or equal to n characters?
> 
> For most of encodings that I've seen, it is a safe assumption.
> Unfortunately, I haven't seen 'em all :) 
> 
> I'm suspicious that it's
> possible to have a byte encode several characters.

I digged around Unicode.org today, to see if I can find some interesting
mappings from native character sets to Unicode which violate that
assumption. I've found the Devagari and Farsi encodings from Apple.

Here is an example from MacFarsi, the character set used to encode
Persian. It's online at:
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/FARSI.TXT

#   For example, the mapping of 0x2B is given as +0x002B; the
#   mapping of 0xAB is given as +0x002B. If we map an isolated
#   instance of 0x2B to Unicode, it should be mapped as follows (LRO
#   indicates LEFT-RIGHT OVERRIDE, PDF indicates POP DIRECTION
#   FORMATTING):
#
# 0x2B ->  0x202D (LRO) + 0x002B (PLUS SIGN) + 0x202C (PDF)


So, in this case, a single Mac OS Farsi code point results in three
Unicode characters. It can actually get even worse:

#   In the TrueType variant of Mac OS Farsi, 0xA4 is a ligature for the
#   currency unit "rial". This is mapped using the grouping hint followed
#   by the Arabic characters for "rial"
#   
# (TrueType variant) 0xA4 -> 0xF86B+0x0631+0x064A+0x0627+0x0644

Here you have a single code point encoded by five (5) Unicode
characters. The grouping hint seems to be a vendor specific extension
from Apple, though. That's still 4:1.

Sun doesn't seem to have included any Farsi or Devangari conversion
mechanisms, so kaffe doesn't really have to support such exotic
encodings. But ... it may one day. So I'd recommend staying on the safe
side.

Dali


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com




Re: [kaffe] Slow byte to char conversion

2000-08-29 Thread Dalibor Topic


Hi Godmar,

Am Die, 29 Aug 2000 schrieb Godmar Back:
> I was looking at this function in String.java:
> 
> 
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
> int len, ByteToCharConverter encoding) {
> StringBuffer sbuf = new StringBuffer(len);
> char[] out = new char[512];
> int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
> while (outlen > 0) {
> sbuf.append(out, 0, outlen);
> outlen = encoding.flush(out, 0, out.length);
> }
> return sbuf;
> }
> 
> 
> Why can't this function be rewritten to read:
> 
> 
> private static StringBuffer decodeBytes(byte[] bytes, int offset,
> int len, ByteToCharConverter encoding) {
> char[] out = new char[len];
> int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
>   return new StringBuffer(outlen).append(out, 0, outlen);
> }
> 
> 
> Is it not fair to assume that converting n bytes will result in less than
> or equal to n characters?

For most of encodings that I've seen, it is a safe assumption.
Unfortunately, I haven't seen 'em all :) 

I'm suspicious that it's
possible to have a byte encode several characters. And here is why:
Unicode supports "combining" characters. These characters are used to
modify other characters. For example, you can add accents to
normal characters. Since Unicode is designed to allow easy conversion
to/from existing character sets, it includes many precomposed
characters, like the german umlauts ä,ö,ü. You'd still need combining
characters to fully represent some scripts, like Thai. Markus
Kuhn says in his "UTF-8 and Unicode FAQ for Unix/Linux" [1] : "with
the Thai script, up to two combining characters are needed on a single
base character. "

In his article on "Forms of Unicode" [2], Mark Davis shows some of the
myths about characters vs code points vs code units. It features a
table with some unexpected things. There is an encoding for the fi
ligature, for example [3]. Some arabian characters' Unicode
representation depends on the context.  Some characters require
several Unicode characters to be represented properly: "The Devangari
syllable ksha is represented by three code points."

I haven't seen an encoding for Devangari, so I don't know whether the
encoding for "ksha" would be less than three bytes. I've seen other
encodings (doing research for this post today), collected by Mark
Leisher as a supplement to the official Unicode conversion tables. And
some of them, like I3342, encode a single byte into several characters
[4]. I don't think any of these encodings is supported by Sun's JDK 1.3,
though.

To sum it up: I'm not convinced. I guess taking a look at GNU
libc iconv functionality should provide some more insight, but I don't
have the sources around right now. The GNU libc folks have done a
massive job supporting a variety of encodings, so this might be another
direction to look for advice..

Read ya,

Dali

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[2] ftp://www6.software.ibm.com/software/developer/library/utfencodingforms.pdf
[3] \uFB01 according to Unicode-Data-3.0.txt
[4] 0xA4 -> 0x0631 0x064A 0x0627 0x0644  for PERSIAN RIAL SIGN


__
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com




Re: [kaffe] Slow byte to char conversion

2000-08-29 Thread Dalibor Topic


Am Mon, 28 Aug 2000 schrieb Artur Biesiadowski:
> Godmar Back wrote:

> I've looked at this and I don't see a reason for CharToByteConverter to
> go through encode/flush stes - it would work perfectly all right with
> single step method, returning new byte[] for example. For

I think there are two possible issues:
a) Unicode characters followed by combining characters
\u0041\u0308 is actually just another way to encode 'Ä' (\u00C4). You
get the same reason for saving state as with multibyte encodings: you
don't know for sure what you've read unless you've read the last bit of
it.

b) Performance
With multibyte encodings, it can be hard to determine the size of the
byte array in advance. So you'd have to do the encoding into a
temporary byte array, and then create a new one, with the right size,
and copy the bytes, before you return it. If all the caller does is to
copy the bytes again into appropriate positions in his byte array, then
you'd be doing a lot of useless work. It might be interesting to see
how char to byte conversion is used in kaffe.

Having flush functionality allows you to stop encoding when you run
out of space in the byte array. You can save the unencoded rest in the
encoder and throw an exception/continue with unencoded characters next
time your conversion routine is called.

Unfortunately, unless you can guarantee that you'd never run out of
space on the byte buffer (which you can't with the current interface),
every stateless converter becomes stateful in a sense that it needs to
carry around unconverted remainders waiting to be flushed. I'm starting
to realize that there are some (undocumented, of course) pitfalls in the
current design of converters which are harder to get around than I
thought. Unless of course ...

> ByteToCharConverter things are a bit different, as streams can stop
> inside multibyte encoding. It could be workaround by changing interface
> a bit and allowing converter to ruteurn number of rejected bytes, which
> would have to be fed to it again on next call. This moves need to
> remember state to OutputStreamWriter and it is ok as it will
> synchronized itself.

Sun's "undocumented" [1] sun.io.CharToByteConverter supports something
like that: you can get the index just past the last converted
character, and by comparing it with the supplieed arguments, figure out
that not everything got converted. A similar interface exists for
sun.io.ByteToCharConverter.

I think your idea to delegate responsibility for state management to
synchronized methods in calling objects could be an elegant way to make
converters stateless.

Dalibor Topic

[1] As Sun don't document their sun.* packages, there is no API
spec to work from [2].

But there is a document on Sun's website which describes
the internals of character set conversion for JDK 1.1. It's marked as
deprecated but contains a nice description of some implementation
details: http://java.sun.com/products/jdk/1.1/intl/html/intlspec.doc7.html

DIGITAL's JDK 1.1.3 includes documentation for these "undocumented" I/O
classes. It's online at 
http://infoshako.sk.tsukuba.ac.jp/InfoRes/jdoc/Languages/Java/digital-java/api/sun.io.CharToByteConverter.html

[2] But there is a working group with Doug Lea, people from IBM, Sun
and some other companies tryng to define some new I/O APIs for the next
Java release. It's just started within the Java Community Process. They
plan to specify an API for character set conversion. That's good news,
I guess.


__
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com




Re: [kaffe] Slow byte to char conversion

2000-08-28 Thread Godmar Back



Dali,

I was looking at this function in String.java:


private static StringBuffer decodeBytes(byte[] bytes, int offset,
int len, ByteToCharConverter encoding) {
StringBuffer sbuf = new StringBuffer(len);
char[] out = new char[512];
int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
while (outlen > 0) {
sbuf.append(out, 0, outlen);
outlen = encoding.flush(out, 0, out.length);
}
return sbuf;
}


Why can't this function be rewritten to read:


private static StringBuffer decodeBytes(byte[] bytes, int offset,
int len, ByteToCharConverter encoding) {
char[] out = new char[len];
int outlen = encoding.convert(bytes, offset, len, out, 0, out.length);
return new StringBuffer(outlen).append(out, 0, outlen);
}


Is it not fair to assume that converting n bytes will result in less than
or equal to n characters?

- Godmar

> 
> 
> Am Mon, 28 Aug 2000 schrieb Artur Biesiadowski:
> 
> > And why exactly default converter could not be cached and same instance
> > used for all conversions ? I think it is stateless class, so it should
> > be safe to enter same object method from various threads with all state
> > on stack.
> 
> It depends on the encoding. Let's say you have a multibyte encoding,
> where several bytes encode a single character, like UTF-8 [1]. You
> can't guarantee that all the byte arrays that you want to encode into
> char arrays terminate on character boundaries. So you need to be
> able to save the state of your converter and pick up at the position
> where you left next time your converter is called.
> 
> Imagine that you're reading in a UTF-8 encoded file, and get an
> IOException while you're reading it. You convert as much as you've
> read, but you can't decide on the last character, since your stream has
> been interrupted. The UTF-8 converter saves its state, and waits for
> bytes to convert to characters.
> 
> Now, imagine another thread tries to do some UTF-8 input
> conversion, too. If it used the first converter, it would get a
> corrupted result, since the first converter is still waiting for bytes
> to continue converting. So you have to use a fresh UTF-8 converter for
> that.
> 
> You could say: "So? Kaffe uses ISO-Latin-1 as default encoding. That's
> stateless.". But unfortunately the default encoding comes from the
> file.encoding system property, which can be changed by the user [2].
> Don't rely on the default encoding being ISO-Latin-1.
> 
> Kaffe does some sort of caching already, but it instantiates
> a new converter every time one is needed, which is not necessary for
> stateless converters, as you've pointed out.
> 
> [1] If you have a Linux installation around, take a look at
> /usr/share/i18n/charmaps/UTF8. It might have a slightly different name
> on your installation, though, since character encodings usually have
> several aliases. 
> 
> [2] Well, sort of. While Java 2 allows system properties to be set,
> kaffe has not caught up with that yet, as far as I know. So the only
> way I know of to change the default encoding is to modify it in
> libraries/clib/native/System.c and recompile kaffe.
> 
> 
> __
> Do You Yahoo!?
> Talk to your friends online with Yahoo! Messenger.
> http://im.yahoo.com
> 




Re: [kaffe] Slow byte to char conversion

2000-08-28 Thread Dalibor Topic


Am Mon, 28 Aug 2000 schrieb Artur Biesiadowski:

> And why exactly default converter could not be cached and same instance
> used for all conversions ? I think it is stateless class, so it should
> be safe to enter same object method from various threads with all state
> on stack.

It depends on the encoding. Let's say you have a multibyte encoding,
where several bytes encode a single character, like UTF-8 [1]. You
can't guarantee that all the byte arrays that you want to encode into
char arrays terminate on character boundaries. So you need to be
able to save the state of your converter and pick up at the position
where you left next time your converter is called.

Imagine that you're reading in a UTF-8 encoded file, and get an
IOException while you're reading it. You convert as much as you've
read, but you can't decide on the last character, since your stream has
been interrupted. The UTF-8 converter saves its state, and waits for
bytes to convert to characters.

Now, imagine another thread tries to do some UTF-8 input
conversion, too. If it used the first converter, it would get a
corrupted result, since the first converter is still waiting for bytes
to continue converting. So you have to use a fresh UTF-8 converter for
that.

You could say: "So? Kaffe uses ISO-Latin-1 as default encoding. That's
stateless.". But unfortunately the default encoding comes from the
file.encoding system property, which can be changed by the user [2].
Don't rely on the default encoding being ISO-Latin-1.

Kaffe does some sort of caching already, but it instantiates
a new converter every time one is needed, which is not necessary for
stateless converters, as you've pointed out.

[1] If you have a Linux installation around, take a look at
/usr/share/i18n/charmaps/UTF8. It might have a slightly different name
on your installation, though, since character encodings usually have
several aliases. 

[2] Well, sort of. While Java 2 allows system properties to be set,
kaffe has not caught up with that yet, as far as I know. So the only
way I know of to change the default encoding is to modify it in
libraries/clib/native/System.c and recompile kaffe.


__
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://im.yahoo.com




Re: [kaffe] Slow byte to char conversion

2000-08-28 Thread Artur Biesiadowski


Godmar Back wrote:


> It is not stateless; it keeps track of not converted characters/bytes
> if there are any left.  See the carry/flush methods.
> 
> The converter only converts 512 bytes at a time (see String.decodeBytes).
> Now just why the converter does that, I don't know.  It's not immediately
> apparent what the motivation for that is, if there is any.
> Plus, I don't understand why the encoder doesn't directly convert into
> the StringBuffer that is going to be returned by decodeBytes.
> It all seems rather strange, and as always there's no comments in the code.

I've looked at this and I don't see a reason for CharToByteConverter to
go through encode/flush stes - it would work perfectly all right with
single step method, returning new byte[] for example. For
ByteToCharConverter things are a bit different, as streams can stop
inside multibyte encoding. It could be workaround by changing interface
a bit and allowing converter to ruteurn number of rejected bytes, which
would have to be fed to it again on next call. This moves need to
remember state to OutputStreamWriter and it is ok as it will
synchronized itself.

I'm going to look how it was solved in classpath (I suppose they have
still almost 2 years old bug with static variables, but thats not
important:), maybe some ideas could be scavenged.

Artur



Re: [kaffe] Slow byte to char conversion

2000-08-28 Thread Godmar Back


> 
> 
> Godmar Back wrote:
> 
> [...]  Every call results in a new
> > converter object being newinstanced, just to convert a bunch of bytes.
> > (The new converter was one of the changes done to make the
> > charset conversion thread-safe.) 
> [...]
> 
> And why exactly default converter could not be cached and same instance
> used for all conversions ? I think it is stateless class, so it should
> be safe to enter same object method from various threads with all state
> on stack.
> 

It is not stateless; it keeps track of not converted characters/bytes
if there are any left.  See the carry/flush methods.

The converter only converts 512 bytes at a time (see String.decodeBytes).
Now just why the converter does that, I don't know.  It's not immediately
apparent what the motivation for that is, if there is any.
Plus, I don't understand why the encoder doesn't directly convert into
the StringBuffer that is going to be returned by decodeBytes.
It all seems rather strange, and as always there's no comments in the code.

- Godmar




Re: [kaffe] Slow byte to char conversion

2000-08-28 Thread Dalibor Topic


Hi Godmar,

sorry for the delay, but I was on holidays last week, and away from my
mail.

Am Sam, 19 Aug 2000 schrieben Sie:
> From what I understand, and someone correct me if I'm wrong,
> there shouldn't be any reason not to include the change you suggest -
> if someone implements it, of course.

Done. I have a patched version of Encode.java. I 'll
clean it up when a definite solution has stabilized.

> If I understand your proposal right, you'd use an array for
> the first 256 values and a hashtable or something like that 
> for the rest.  I don't think there would be a problem with changing 
> it so that it would both serialize an array and a hashtable.
> One or two objects in *.ser shouldn't make a difference. 

Yes. It should work nicely for ISO-8859 based encodings, and 
then for some. 

Actually, for byte to char conversion you don't even
need a hash table, since all ISO-8859-X assign unicode chars 
(simply speaking) to byte values in the range 0-255. 

For the reverse way (char to byte conversion) I'd need to do some
experiments to figure out a better way. In most character to byte
encodings, there is no single range from character x to character y
where all characters are mapped from. So the array based approach is
space-inefficient. A combination of arrays and hasmaps might be
interesting. But for the time being, I'm playing around with
java.io.InputStreamReader, so I'm trying to fix byte to char conversion
first.

> You could even stick a flag at the beginning if the array shouldn't
> pay off for some encodings.

I'd prefer a more class hierarchy based approach. We already have
kaffe.io.ByteToCharHashBased. We could have ByteToCharArrayBased, too.
Something like this (warning: untested code ahead):

abstract public class ByteToCharArrayBased extends ByteToCharConverter {

// map is used to map characters from bytes to chars. A byte
// code b is mapped to character map[b & 0xFF].
private final char [] map;

public ByteToCharArrayBased ( char [] chars) {
map = chars;
}

public final int convert (byte[] from, int fpos, int flen,
char[] to, int tpos, int tlen) {
// Since it's a one to one encoding assume that
// flen == tlen.
for (int i = flen; i > 0; i --) {
to[ tpos++] = convert( from [ fpos++ ]);
}
return flen;
}

public final char convert (byte b) {
return map[b & 0xFF ];
}

public final int getNumberOfChars(byte [] from, int fpos, int
flen) {
return flen;
}
}

Now a (byte to char) conversion class has three choices:
a) it uses all byte values from 0-255 -> it extends
ByteToCharArrayBased, and makes the constructor use the
appropriate char array.
b) the encoded byte values are sparsely distributed through the range
of all legal byte values -> it extends ByteToCharHashBased due to
its space efficiency.
c) there is a huge block of bytes used in the encoding, but there are
also many bytes outside that block's range used in the encoding -> it
extends ByteToCharConverter and uses fields for ArrayBased as well
HashBased conversion. The convert method checks whether a byte is
within the block and uses the array, or uses the hash table otherwise.

>From my experience (converting ISO-8859-X encodings from being hash
based to being array based), for byte to char conversion option (a)
takes little memory (256 chars for the table) and is very fast. As I
explained in my previous post, it beats option (b) in time-efficiency.
I suppose it beats it in space-efficiency as well, as long as most
bytes are convertable into characters.

When there are only a few lagal byte values, which can be encoded into
characters, the hash based conversion could be more space-efficient. On
the other hand, the array based implementation doesn't waste much
memory in that case. Even in the worst case, a fictive encoding that
solely uses a specific single byte to encode some character, there are
255 * 2 = 510 bytes wasted. That's not much, and can be improved upon,
by introducing range checks and similar techniques.

The choices really start to matter when you're going the other way
round, from chars to bytes. Take a look at ISO-8859-8 (a.k.a. hebrew).
It encodes 220 characters. Of these 220, only 32 are *not*
mapping a byte value to itself. They are either mappings into the
range between \u05D0 and \u05EA, mappings within the first 256
characters, or mappings to a few special characters like LEFT-TO-RIGHT
MARK.

 > One would have to see what the actual sizes of the
.ser files would be; > keeping those small is certainly desirable.  From
what I understand, > they're more compact than any Java code
representation. > Edouard would know more since he wrote that code, I
think. >  >  > On a related note, this whole conversion thing stinks. >
Why can't people stick to 7-bit ASCII? > For instance, the

Re: [kaffe] Slow byte to char conversion

2000-08-28 Thread Artur Biesiadowski


Godmar Back wrote:

[...]  Every call results in a new
> converter object being newinstanced, just to convert a bunch of bytes.
> (The new converter was one of the changes done to make the
> charset conversion thread-safe.) 
[...]

And why exactly default converter could not be cached and same instance
used for all conversions ? I think it is stateless class, so it should
be safe to enter same object method from various threads with all state
on stack.

Artur



Re: [kaffe] Slow byte to char conversion

2000-08-18 Thread Godmar Back



>From what I understand, and someone correct me if I'm wrong,
there shouldn't be any reason not to include the change you suggest -
if someone implements it, of course.

If I understand your proposal right, you'd use an array for
the first 256 values and a hashtable or something like that 
for the rest.  I don't think there would be a problem with changing 
it so that it would both serialize an array and a hashtable.
One or two objects in *.ser shouldn't make a difference. 
You could even stick a flag at the beginning if the array shouldn't
pay off for some encodings.
One would have to see what the actual sizes of the .ser files would be;
keeping those small is certainly desirable.  From what I understand,
they're more compact than any Java code representation.
Edouard would know more since he wrote that code, I think.


On a related note, this whole conversion thing stinks.
Why can't people stick to 7-bit ASCII?
For instance, the JVM98 jack benchmark calls PrintStream.print
a whopping 296218 times in a single run.  Every call results in a new 
converter object being newinstanced, just to convert a bunch of bytes. 
(The new converter was one of the changes done to make the
charset conversion thread-safe.)  This is one of the reasons
why we're on this test some 7 or 8 times slower than IBM.
And that's not even using any of the serialized converters, just 
the default one (which is written in JNI).

- Godmar

> 
> 
> Hi,
> 
> I wrote a simple program to show a Java charmap (
> something like Encode.java in developers directory).
> It essentially creates a byte array with size 1, and
> creates a string with the appropriate Unicode char
> using the encoding in question for every value a byte
> can take.
> 
> When displaying a serialized converter like 8859_2,
> the performance is very bad. Comparing current kaffe
> from CVS running on SuSE Linux 6.4 with jit3 and IBM's
> JRE 1.3 running in interpreted mode, kaffe is about 10
> times slower.
> 
> While I consider the idea to use serialized encoders
> based on hashtables a great one, it is very
> inefficient for ISO-8859-X and similar byte to char
> encodings. These encodings use most of the 256
> possible values a byte can take to encode characters,
> so I tried using an array instead. I achieved
> comparable running times to JRE 1.3.
> 
> Why was the hashtable based conversion chosen over
> alternatives (switch based lookup, array based
> lookup)?
> 
> Dali
> 
> =
> "Success means never having to wear a suit"
> 
> __
> Do You Yahoo!?
> Send instant messages & get email alerts with Yahoo! Messenger.
> http://im.yahoo.com/
> 




[kaffe] Slow byte to char conversion

2000-08-16 Thread Dalibor Topic


Hi,

I wrote a simple program to show a Java charmap (
something like Encode.java in developers directory).
It essentially creates a byte array with size 1, and
creates a string with the appropriate Unicode char
using the encoding in question for every value a byte
can take.

When displaying a serialized converter like 8859_2,
the performance is very bad. Comparing current kaffe
from CVS running on SuSE Linux 6.4 with jit3 and IBM's
JRE 1.3 running in interpreted mode, kaffe is about 10
times slower.

While I consider the idea to use serialized encoders
based on hashtables a great one, it is very
inefficient for ISO-8859-X and similar byte to char
encodings. These encodings use most of the 256
possible values a byte can take to encode characters,
so I tried using an array instead. I achieved
comparable running times to JRE 1.3.

Why was the hashtable based conversion chosen over
alternatives (switch based lookup, array based
lookup)?

Dali

=
"Success means never having to wear a suit"

__
Do You Yahoo!?
Send instant messages & get email alerts with Yahoo! Messenger.
http://im.yahoo.com/