Re: The use of UTIUtil.toUsingCharset?
Eric E Johnson wrote: As for why these functions exist, I keep thinking along these lines - imagine you want to encode foreign language characters in a URL. The way to do it is to convert your string into bytes, and then URL encode the bytes as if it were ASCII. Reversing the process, take your URL, decode it into ASCII, treat each character as a byte, and then convert those bytes back via the expected encoding. So you can imagine that the first step would be precisely what these routines do - a conversion of a String into byte encoding XXX, and then back into a String in encoding YYY, where YYY almost certainly is ASCII. Having done that, you can use all your functions that URL encode a String instead of writing an additional function that takes bytes. Unfortunately, if the encoding YYY has any characters outside the 0-255 range, you'd be hosed, and the documentation doesn't say that. This correct modulo all the phrases with the word ASCII in it. It's just about a sequence of bytes and has nothing to to with ASCII (which is 7-bit only by the way). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Ortwin Glück wrote: This correct modulo all the phrases with the word ASCII in it. It's just about a sequence of bytes and has nothing to to with ASCII (which is 7-bit only by the way). Yes, of course. I'm not very good with the names of my encodings, just some of the issues surrounding them. I was merely trying to come up with a plausible explanation as to why the functions exist in the first place, not an explanation for how they could possibly be considered correct, which I don't think they can be. -Elric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Oleg, I can't say I comletely agree with your point (or understand it), but so be it. Feel free to ask for clarification. Basically I was trying (in my wordy way) to say that toUsingCharset seems to do two things: - Convert the Unicode string to an array of bytes using the converter for fromCharset - Convert the bytes back to Unicode using the converter for toCharset. This makes no sense to me. When you're doing character-set-aware programming and have an array of bytes, you always need to keep a (byte[], charset name) pair, so you know what the bytes *mean*. The bytes by themselves are just a bit stream; the character set name tells you how to interpret the bits into abstract characters that mean something to a human. toUsingCharset is converting the Unicode string to a bit stream using one mechanism, then converting back to Unicode using another mechanism. I don't know how this could ever do anything useful. Had not Sung-Su refused to provide a simple unit test case for this method, this discussion would have been put to an end a few months ago. But apparently writing test cases is for losers How about if we just deprecate the @#% thing and the two URIUtil methods that call it? -- Laura - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: The use of UTIUtil.toUsingCharset?
How about if we just deprecate the @#% thing and the two URIUtil methods that call it? For what it's worth, Laura and Oleg, you are completely correct. The toUsingCharset method is 100% guaranteed to screw up characters, the only question is which characters. I would depreciate the code and possibly even change it so that it just returns the original String. That way at least it would not corrupt any characters. (I'm assuming Jandalf won't let us just rip it out all together at this point because that would be my preferred option). Adrian Sutton, Software Engineer Ephox Corporation www.ephox.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
oh no! You should keep your name as is since you were here first. But are you sure I'm not really one of your split personalities ;) EJJ (who changed his send line in order to avoid masquerading as the 'good' Eric) -eej. P.S. I changed my name on send line, so as to avoid being confused with the newcomer also known as Eric Johnson. Just my luck. I bet some of us share the same birthday too. If only I contributed enough to be be blessed with a Middle-Earth name, then I wouldn't have to worry about ambiguity! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Thanks Laura for this excellent explanation. This really helps to clear things up! I am glad to have you and your indepth Unicode knowledge on the list. I always thought you could roundtrip any charset to Unicode and get the same thing back. This is obviously wrong. It should be easy to write a test case for this once we have some of those characters. Sung-Gu: Could you please post some of this problematic characters (hex values in different encodings and Unicode)? You are probably the only one who has knwoledge of Asian languages here. Hopefully we can find an adequate solution for the problem now. Cheers Odi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Sung-Gu wrote: There isn't any uni-one to support the various charsets.(Let you regard it!) Then, once it was tranformed, it should be tranformed back to the original. That makes the transformed one to the original one. Sung-Gu, I have problems understanding your English and I can only guess what you want to say. Do you mean that there are characters that have no representation in Unicode? Your method uses String objects, which means Unicode! If there are characters not present in the Unicode set, they can not be handled by the String class. You must use byte[] in this case. You speak of transformation. What sort of transformation is that? The only transformation your method does is, it replaces some characters with '?'. Odi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
- Original Message - From: Ortwin Glück [EMAIL PROTECTED] Arrrg... again... :( Not surprising though... :((( by the String class. You must use byte[] in this case. It was... You speak of transformation. What sort of transformation is that? The import sun.nio.cs.StandardCharsets; import java.nio.charset.Charset; import java.nio.charset.spi.CharsetProvider; import java.util.Iterator; main CharsetProvider standardProvider = new StandardCharsets(); for (Iterator i = standardProvider.charsets(); i.hasNext();) { System.out.println(i.next()); } What can you get it? And what can you do it with them? Could you please explain to me? Sung-Gu P.S.: BTW, it's almost time to go home... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Hi Sung-Gu On Tue, 2003-02-04 at 11:37, Sung-Gu wrote: Hi Oleg, Again... well.. Ok... let me try to make you understand it again. HmmHmm... Let's assume I am stupid BTW, sorry to bother you that I haven't got you to get it right away at that time even with a diagram and still... :( Let's assume I am VERY stupid Actually, that's very easy... And not that important unless it's not going to be support multilinqual. Cmon, Java uses Unicode natively to represent strings. I'd like to hope you are familiar with the concept of Unicode. Unicode automatically enables multilingual support for all Java String objects. The concept of character encoding is applicable only to String to byte[] or byte[] to String transformations. Think it over Oleg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Sung-Gu wrote: - Original Message - From: Ortwin Glück [EMAIL PROTECTED] Arrrg... again... :( Not surprising though... :((( Sung-Gu, I don't want to upset you. I just want to understand the problem that you are trying to solve with toUsingCharset. Your explanations did not help so far. Call me stupid but I guess I am not the only one here who doesn't understand the problem. (if I am wrong could someone else please tell me) You speak of transformation. What sort of transformation is that? The import sun.nio.cs.StandardCharsets; Maybe you could just answer the following questions with yes or no each: 1. Is the problem related with characters that have no Unicode code assigned? 2. Is the problem that you want to pass non ISO-8859-1 data in POST or GET parameters? 3. Is a String object capable of containing characters that have no Unicode representation? 4. Is a byte[] capable of containing characters that have no Unicode representation? - CharsetProvider standardProvider = new StandardCharsets(); for (Iterator i = standardProvider.charsets(); i.hasNext();) { System.out.println(i.next()); } What can you get it? And what can you do it with them? Could you please explain to me? -- A Charset instance can convert String objects to byte[] and vice versa using a specific encoding. Charset instances are factored by the CharsetProvider. These classes are new as of JDK 1.4. In earlier JDKs these interfaces were burried deep inside the Sun implementation and not for public use. HTH Odi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Hi Sung-Gu, Actually, that's very easy... And not that important unless it's not going to be support multilinqual. As you see the diagram, bytes informations created from the original charset should be restored. That's all. My understanding of what you're saying is that if someone constructs a URI using escaped characters in a particular charset (e.g. Big-5), using the URI(char[] escaped) constructor, then URI needs to preserve those characters. If someone asks for the URI back as an escaped string in the original charset (e.g. Big-5 again), we need to give them the *exact* original string; it's not good enough to trancode from the escaped Big-5 string to Unicode and back to Big-5. Is this correct? If this is true, I have a few comments on why this matters... -- First, for those who don't understand why you can't just convert everything to Unicode and stop worrying, there is some sense behind this. When Unicode was invented, the far-east languages were Unified into the Han block of Unicode. Some characters that have distinct codes in the native double-byte character sets were mapped to single Unicode characters. This meant that some native character sets wouldn't round trip to Unicode and back. It was essentially a political compromise -- the Unicode folks needed to save space in the 64k base plane, so they merged Han characters that meant very similar things and looked almost exactly same. (Emphasis similar and almost.) But in native charsets that didn't need to have room for Korean and Cyrillic and all the other stuff that's in Unicode, there's room to split out multiple versions of these characters that are merged together. -- There are also a few new character sets like JIS-212 that contain characters (like Japanese dental symbols, believe it or not) that haven't been encoded in Unicode yet. Presumably we'd want to keep the encoded URI string around so that we can preserve this kind of character. (In a past life I managed the Unicode group at IBM, and I remember far more of this stuff than I thought I did.) A few comments on URI.java and URIUtil.java -- I think the comments need to be greatly improved. It's very hard to figure out what many of the methods do. In the cases where I can figure out what they do, it's hard to figure out *why*. -- It would be nice if the documentation explained the charset concepts: What is a document charset and a protocol charset and so on. A reference to the RFC is nice, but a more concice explanation in the JavaDoc would be better. Laura, hoping I helped answer part of the why here, at least - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
- Original Message - From: Laura Werner [EMAIL PROTECTED] Hi Sung-Gu, Actually, that's very easy... And not that important unless it's not going to be support multilinqual. As you see the diagram, bytes informations created from the original charset should be restored. That's all. My understanding of what you're saying is that if someone constructs a URI using escaped characters in a particular charset (e.g. Big-5), using the URI(char[] escaped) constructor, then URI needs to preserve those characters. If someone asks for the URI back as an escaped string in the original charset (e.g. Big-5 again), we need to give them the *exact* original string; it's not good enough to trancode from the escaped Big-5 string to Unicode and back to Big-5. Is this correct? If this is true, I have a few comments on why this matters... -- First, for those who don't understand why you can't just convert everything to Unicode and stop worrying, there is some sense behind this. When Unicode was invented, the far-east languages were Unified into the Han block of Unicode. Some characters that have distinct codes in the native double-byte character sets were mapped to single Unicode characters. This meant that some native character sets wouldn't round trip to Unicode and back. It was essentially a political compromise -- the Unicode folks needed to save space in the 64k base plane, so they merged Han characters that meant very similar things and looked almost exactly same. (Emphasis similar and almost.) But in native charsets that didn't need to have room for Korean and Cyrillic and all the other stuff that's in Unicode, there's room to split out multiple versions of these characters that are merged together. -- There are also a few new character sets like JIS-212 that contain characters (like Japanese dental symbols, believe it or not) that haven't been encoded in Unicode yet. Presumably we'd want to keep the encoded URI string around so that we can preserve this kind of character. (In a past life I managed the Unicode group at IBM, and I remember far more of this stuff than I thought I did.) Excellent explantion! It is described at a url that I poinited though on this mailling-list before. I think, your one is much nice! ;) A few comments on URI.java and URIUtil.java -- I think the comments need to be greatly improved. It's very hard to Not enough to just comment it out... I think... Some article about this is written aleady in URI class for someone to notice that...and something is still left to do... as your comment... figure out what many of the methods do. In the cases where I can figure out what they do, it's hard to figure out *why*. -- It would be nice if the documentation explained the charset concepts: What is a document charset and a protocol charset and so on. A reference to the RFC is nice, but a more concice explanation in the JavaDoc would be better. Actually, my problem is the fact that I just know how to, I guess. It's hard for me to understand someones not to expience that I think I will have a chance sometime later... Laura, hoping I helped answer part of the why here, at least Thank you very much, Laura! ;) Sung-Gu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Laura Finally, there's someone who can read Sung-Gu's mind! All right. A simple phrase There are charsets that are not adequately represented in Unicode by Sung-Gu would have put the discussion into a completely different perspective. And of course, Sung-Gu's stoical refusal to provide a test case for the method did not help either. Many thanks Oleg On Tue, 2003-02-04 at 22:51, Laura Werner wrote: Hi Sung-Gu, Actually, that's very easy... And not that important unless it's not going to be support multilinqual. As you see the diagram, bytes informations created from the original charset should be restored. That's all. My understanding of what you're saying is that if someone constructs a URI using escaped characters in a particular charset (e.g. Big-5), using the URI(char[] escaped) constructor, then URI needs to preserve those characters. If someone asks for the URI back as an escaped string in the original charset (e.g. Big-5 again), we need to give them the *exact* original string; it's not good enough to trancode from the escaped Big-5 string to Unicode and back to Big-5. Is this correct? If this is true, I have a few comments on why this matters... -- First, for those who don't understand why you can't just convert everything to Unicode and stop worrying, there is some sense behind this. When Unicode was invented, the far-east languages were Unified into the Han block of Unicode. Some characters that have distinct codes in the native double-byte character sets were mapped to single Unicode characters. This meant that some native character sets wouldn't round trip to Unicode and back. It was essentially a political compromise -- the Unicode folks needed to save space in the 64k base plane, so they merged Han characters that meant very similar things and looked almost exactly same. (Emphasis similar and almost.) But in native charsets that didn't need to have room for Korean and Cyrillic and all the other stuff that's in Unicode, there's room to split out multiple versions of these characters that are merged together. -- There are also a few new character sets like JIS-212 that contain characters (like Japanese dental symbols, believe it or not) that haven't been encoded in Unicode yet. Presumably we'd want to keep the encoded URI string around so that we can preserve this kind of character. (In a past life I managed the Unicode group at IBM, and I remember far more of this stuff than I thought I did.) A few comments on URI.java and URIUtil.java -- I think the comments need to be greatly improved. It's very hard to figure out what many of the methods do. In the cases where I can figure out what they do, it's hard to figure out *why*. -- It would be nice if the documentation explained the charset concepts: What is a document charset and a protocol charset and so on. A reference to the RFC is nice, but a more concice explanation in the JavaDoc would be better. Laura, hoping I helped answer part of the why here, at least - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The use of UTIUtil.toUsingCharset?
Sung-Gu, From your diagram I do not see anything that is not supported by standard Java String handling. I still think this method is unnecessary. Your test case does not contain a single assertion. Printing out garbage to the console doesn't make sense. PLEASE PROVIDE AN ORDINARY JUNIT TEST CASE! Sung-Gu wrote: Well, it's done a bit... Sung-Gu -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]