Re: The use of UTIUtil.toUsingCharset?

2003-02-21 Thread Ortwin Glück
Eric E Johnson wrote:

As for why these functions exist, I keep thinking along these lines - 
imagine you want to encode foreign language characters in a URL.  The 
way to do it is to convert your string into bytes, and then URL encode 
the bytes as if it were ASCII.  Reversing the process, take your URL, 
decode it into ASCII, treat each character as a byte, and then convert 
those bytes back via the expected encoding.  So you can imagine that the 
first step would be precisely what these routines do - a conversion of a 
String into byte encoding XXX, and then back into a String in encoding 
YYY, where YYY almost certainly is ASCII.  Having done that, you can use 
all your functions that URL encode a String instead of writing an 
additional function that takes bytes.  Unfortunately, if the encoding 
YYY has any characters outside the 0-255 range, you'd be hosed, and the 
documentation doesn't say that.


This correct modulo all the phrases with the word ASCII in it. It's 
just about a sequence of bytes and has nothing to to with ASCII (which 
is 7-bit only by the way).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The use of UTIUtil.toUsingCharset?

2003-02-21 Thread Eric E Johnson
Ortwin Glück wrote:


This correct modulo all the phrases with the word ASCII in it. It's 
just about a sequence of bytes and has nothing to to with ASCII (which 
is 7-bit only by the way).

Yes, of course.  I'm not very good with the names of my encodings, just 
some of the issues surrounding them.  I was merely trying to come up 
with a plausible explanation as to why the functions exist in the first 
place, not an explanation for how they could possibly be considered 
correct, which I don't think they can be.

-Elric


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The use of UTIUtil.toUsingCharset?

2003-02-20 Thread Laura Werner
Oleg,


I can't say I comletely agree with your point (or understand it), but so be it. 
 

Feel free to ask for clarification.

Basically I was trying (in my wordy way) to say that toUsingCharset 
seems to do two things:

- Convert the Unicode string to an array of bytes using the converter 
for fromCharset
- Convert the bytes back to Unicode using the converter for toCharset.

This makes no sense to me.  When you're doing character-set-aware 
programming and have an array of bytes, you always need to keep a 
(byte[], charset name) pair, so you know what the bytes *mean*.  The 
bytes by themselves are just a bit stream; the character set name tells 
you how to interpret the bits into abstract characters that mean 
something to a human.  toUsingCharset is converting the Unicode string 
to a bit stream using one mechanism, then converting back to Unicode 
using another mechanism.  I don't know how this could ever do anything 
useful.

Had not Sung-Su refused to provide a simple unit test case for this method, this discussion would have been put to an end a few months ago. But apparently writing test cases is for losers
 

How about if we just deprecate the @#% thing and the two URIUtil methods 
that call it?

-- Laura

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: The use of UTIUtil.toUsingCharset?

2003-02-20 Thread Adrian Sutton
How about if we just deprecate the @#% thing and the two URIUtil methods 
that call it?

For what it's worth, Laura and Oleg, you are completely correct.  The
toUsingCharset method is 100% guaranteed to screw up characters, the only
question is which characters.

I would depreciate the code and possibly even change it so that it just
returns the original String.  That way at least it would not corrupt any
characters.  (I'm assuming Jandalf won't let us just rip it out all together
at this point because that would be my preferred option).

Adrian Sutton, Software Engineer
Ephox Corporation
www.ephox.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: The use of UTIUtil.toUsingCharset?

2003-02-20 Thread Eric J Johnson
oh no!

You should keep your name as is since you were here
first.  But are you sure I'm not really one of your
split personalities ;)

EJJ

(who changed his send line in order to avoid
masquerading as the 'good' Eric)

 
 -eej.
 
 P.S. I changed my name on send line, so as to avoid
 being confused with 
 the newcomer also known as Eric Johnson.  Just my
 luck.  I bet some of 
 us share the same birthday too.  If only I
 contributed enough to be be 
 blessed with a Middle-Earth name, then I wouldn't
 have to worry about 
 ambiguity!
 
 

-
 To unsubscribe, e-mail:

[EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: The use of UTIUtil.toUsingCharset?

2003-02-05 Thread Ortwin Glück
Thanks Laura for this excellent explanation. This really helps to clear 
things up! I am glad to have you and your indepth Unicode knowledge on 
the list.

I always thought you could roundtrip any charset to Unicode and get the 
same thing back. This is obviously wrong. It should be easy to write a 
test case for this once we have some of those characters.

Sung-Gu: Could you please post some of this problematic characters (hex 
values in different encodings and Unicode)? You are probably the only 
one who has knwoledge of Asian languages here.

Hopefully we can find an adequate solution for the problem now.

Cheers

Odi



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Ortwin Glück
Sung-Gu wrote:

There isn't any uni-one to support the various charsets.(Let you regard it!)
Then, once it was tranformed, it should be tranformed back to the original.
That makes the transformed one to the original one.


Sung-Gu,

I have problems understanding your English and I can only guess what you 
want to say.

Do you mean that there are characters that have no representation in 
Unicode? Your method uses String objects, which means Unicode! If there 
are characters not present in the Unicode set, they can not be handled 
by the String class. You must use byte[] in this case.

You speak of transformation. What sort of transformation is that? The 
only transformation your method does is, it replaces some characters 
with '?'.

Odi



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Sung-Gu

- Original Message -
From: Ortwin Glück [EMAIL PROTECTED]

Arrrg...  again...  :(
Not surprising though...  :(((

 by the String class. You must use byte[] in this case.
It was...

 You speak of transformation. What sort of transformation is that? The

import sun.nio.cs.StandardCharsets;
import java.nio.charset.Charset;
import java.nio.charset.spi.CharsetProvider;
import java.util.Iterator;

main
CharsetProvider standardProvider = new StandardCharsets();
for (Iterator i = standardProvider.charsets(); i.hasNext();) {
System.out.println(i.next());
}

What can you get it?
And what can you do it with them?
Could you please explain to me?

Sung-Gu

P.S.: BTW, it's almost time to go home...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Oleg Kalnichevski
Hi Sung-Gu

On Tue, 2003-02-04 at 11:37, Sung-Gu wrote:
 Hi Oleg,
 Again... well..
 Ok... let me try to make you understand it again.  HmmHmm...
 

Let's assume I am stupid

 BTW, sorry to bother you that I haven't got you to get it right away
 at that time even with a diagram and still...  :(
 

Let's assume I am VERY stupid

 Actually, that's very easy...
 And not that important unless it's not going to be support multilinqual.
 

Cmon, Java uses Unicode natively to represent strings. I'd like to hope
you are familiar with the concept of Unicode. Unicode automatically
enables multilingual support for all Java String objects. The concept of
character encoding is applicable only to String to byte[] or byte[] to
String transformations.

Think it over

Oleg


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Ortwin Glück
Sung-Gu wrote:

- Original Message -
From: Ortwin Glück [EMAIL PROTECTED]

Arrrg...  again...  :(
Not surprising though...  :(((


Sung-Gu, I don't want to upset you. I just want to understand the 
problem that you are trying to solve with toUsingCharset. Your 
explanations did not help so far. Call me stupid but I guess I am not 
the only one here who doesn't understand the problem. (if I am wrong 
could someone else please tell me)

You speak of transformation. What sort of transformation is that? The

import sun.nio.cs.StandardCharsets;



Maybe you could just answer the following questions with yes or no each:

1. Is the problem related with characters that have no Unicode code 
assigned?

2. Is the problem that you want to pass non ISO-8859-1 data in POST or 
GET parameters?

3. Is a String object capable of containing characters that have no 
Unicode representation?

4. Is a byte[] capable of containing characters that have no Unicode 
representation?


-
CharsetProvider standardProvider = new StandardCharsets();
for (Iterator i = standardProvider.charsets(); i.hasNext();) {
System.out.println(i.next());
}

What can you get it?
And what can you do it with them?
Could you please explain to me?
--

A Charset instance can convert String objects to byte[] and vice versa 
using a specific encoding. Charset instances are factored by the 
CharsetProvider. These classes are new as of JDK 1.4. In earlier JDKs 
these interfaces were burried deep inside the Sun implementation and not 
for public use.


HTH

Odi


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Laura Werner
Hi Sung-Gu,


Actually, that's very easy...
And not that important unless it's not going to be support multilinqual.

As you see the diagram, bytes informations created from the original charset
should be restored.  That's all.
 

My understanding of what you're saying is that if someone constructs a 
URI using escaped characters in a particular charset (e.g. Big-5), using 
the URI(char[] escaped) constructor, then URI needs to preserve those 
characters.  If someone asks for the URI back as an escaped string in 
the original charset (e.g. Big-5 again), we need to give them the 
*exact* original string; it's not good enough to trancode from the 
escaped Big-5 string to Unicode and back to Big-5.  Is this correct?

If this is true, I have a few comments on why this matters...

-- First, for those who don't understand why you can't just convert 
everything to Unicode and stop worrying, there is some sense behind 
this.  When Unicode was invented, the far-east languages were Unified 
into the Han block of Unicode.  Some characters that have distinct codes 
in the native double-byte character sets were mapped to single Unicode 
characters.  This meant that some native character sets wouldn't round 
trip to Unicode and back.  It was essentially a political compromise -- 
the Unicode folks needed to save space in the 64k base plane, so they 
merged Han characters that meant very similar things and looked almost 
exactly same.  (Emphasis similar and almost.)  But in native 
charsets that didn't need to have room for Korean and Cyrillic and all 
the other stuff that's in Unicode, there's room to split out multiple 
versions of these characters that are merged together.

-- There are also a few new character sets like JIS-212 that contain 
characters (like Japanese dental symbols, believe it or not) that 
haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
encoded URI string around so that we can preserve this kind of character.

(In a past life I managed the Unicode group at IBM, and I remember far 
more of this stuff than I thought I did.)

A few comments on URI.java and URIUtil.java

-- I think the comments need to be greatly improved.  It's very hard to 
figure out what many of the methods do.  In the cases where I can figure 
out what they do, it's hard to figure out *why*. 

-- It would be nice if the documentation explained the charset concepts: 
What is a document charset and a protocol charset and so on.  A 
reference to the RFC is nice, but a more concice explanation in the 
JavaDoc would be better.

Laura, hoping I helped answer part of the why here, at least


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Sung-Gu

- Original Message -
From: Laura Werner [EMAIL PROTECTED]


 Hi Sung-Gu,

 Actually, that's very easy...
 And not that important unless it's not going to be support multilinqual.
 
 As you see the diagram, bytes informations created from the original
charset
 should be restored.  That's all.
 
 
 My understanding of what you're saying is that if someone constructs a
 URI using escaped characters in a particular charset (e.g. Big-5), using
 the URI(char[] escaped) constructor, then URI needs to preserve those
 characters.  If someone asks for the URI back as an escaped string in
 the original charset (e.g. Big-5 again), we need to give them the
 *exact* original string; it's not good enough to trancode from the
 escaped Big-5 string to Unicode and back to Big-5.  Is this correct?

 If this is true, I have a few comments on why this matters...

 -- First, for those who don't understand why you can't just convert
 everything to Unicode and stop worrying, there is some sense behind
 this.  When Unicode was invented, the far-east languages were Unified
 into the Han block of Unicode.  Some characters that have distinct codes
 in the native double-byte character sets were mapped to single Unicode
 characters.  This meant that some native character sets wouldn't round
 trip to Unicode and back.  It was essentially a political compromise --
 the Unicode folks needed to save space in the 64k base plane, so they
 merged Han characters that meant very similar things and looked almost
 exactly same.  (Emphasis similar and almost.)  But in native
 charsets that didn't need to have room for Korean and Cyrillic and all
 the other stuff that's in Unicode, there's room to split out multiple
 versions of these characters that are merged together.

 -- There are also a few new character sets like JIS-212 that contain
 characters (like Japanese dental symbols, believe it or not) that
 haven't been encoded in Unicode yet.  Presumably we'd want to keep the
 encoded URI string around so that we can preserve this kind of character.

 (In a past life I managed the Unicode group at IBM, and I remember far
 more of this stuff than I thought I did.)

Excellent explantion!
It is described at a url that I poinited though on this mailling-list
before.
I think, your one is much nice! ;)

 A few comments on URI.java and URIUtil.java

 -- I think the comments need to be greatly improved.  It's very hard to


Not enough to just comment it out... I think...
Some article about this is written aleady in URI class for someone
to notice that...and something is still left to do... as your comment...

 figure out what many of the methods do.  In the cases where I can figure
 out what they do, it's hard to figure out *why*.


 -- It would be nice if the documentation explained the charset concepts:
 What is a document charset and a protocol charset and so on.  A
 reference to the RFC is nice, but a more concice explanation in the
 JavaDoc would be better.

Actually, my problem is the fact that I just know how to, I guess.
It's hard for me to understand someones not to expience that
I think I will have a chance sometime later...

 Laura, hoping I helped answer part of the why here, at least

Thank you very much, Laura! ;)

Sung-Gu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: The use of UTIUtil.toUsingCharset?

2003-02-04 Thread Oleg Kalnichevski
Laura

Finally, there's someone who can read Sung-Gu's mind! 

All right. A simple phrase There are charsets that are not adequately
represented in Unicode by Sung-Gu would have put the discussion into a
completely different perspective. And of course, Sung-Gu's stoical
refusal to provide a test case for the method did not help either. 

Many thanks

Oleg 



On Tue, 2003-02-04 at 22:51, Laura Werner wrote:
 Hi Sung-Gu,
 
 Actually, that's very easy...
 And not that important unless it's not going to be support multilinqual.
 
 As you see the diagram, bytes informations created from the original charset
 should be restored.  That's all.
   
 
 My understanding of what you're saying is that if someone constructs a 
 URI using escaped characters in a particular charset (e.g. Big-5), using 
 the URI(char[] escaped) constructor, then URI needs to preserve those 
 characters.  If someone asks for the URI back as an escaped string in 
 the original charset (e.g. Big-5 again), we need to give them the 
 *exact* original string; it's not good enough to trancode from the 
 escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
 
 If this is true, I have a few comments on why this matters...
 
 -- First, for those who don't understand why you can't just convert 
 everything to Unicode and stop worrying, there is some sense behind 
 this.  When Unicode was invented, the far-east languages were Unified 
 into the Han block of Unicode.  Some characters that have distinct codes 
 in the native double-byte character sets were mapped to single Unicode 
 characters.  This meant that some native character sets wouldn't round 
 trip to Unicode and back.  It was essentially a political compromise -- 
 the Unicode folks needed to save space in the 64k base plane, so they 
 merged Han characters that meant very similar things and looked almost 
 exactly same.  (Emphasis similar and almost.)  But in native 
 charsets that didn't need to have room for Korean and Cyrillic and all 
 the other stuff that's in Unicode, there's room to split out multiple 
 versions of these characters that are merged together.
 
 -- There are also a few new character sets like JIS-212 that contain 
 characters (like Japanese dental symbols, believe it or not) that 
 haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
 encoded URI string around so that we can preserve this kind of character.
 
 (In a past life I managed the Unicode group at IBM, and I remember far 
 more of this stuff than I thought I did.)
 
 A few comments on URI.java and URIUtil.java
 
 -- I think the comments need to be greatly improved.  It's very hard to 
 figure out what many of the methods do.  In the cases where I can figure 
 out what they do, it's hard to figure out *why*. 
 
 -- It would be nice if the documentation explained the charset concepts: 
 What is a document charset and a protocol charset and so on.  A 
 reference to the RFC is nice, but a more concice explanation in the 
 JavaDoc would be better.
 
 Laura, hoping I helped answer part of the why here, at least
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: The use of UTIUtil.toUsingCharset?

2003-01-26 Thread Ortwin Glck
Sung-Gu,

From your diagram I do not see anything that is not supported by 
standard Java String handling. I still think this method is unnecessary.

Your test case does not contain a single assertion. Printing out 
garbage to the console doesn't make sense. PLEASE PROVIDE AN ORDINARY 
JUNIT TEST CASE!

Sung-Gu wrote:
Well, it's done a bit...

Sung-Gu




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]