Re: [VOTE] Re: 2.0 release - deprecate some methods?

Adrian Sutton Thu, 26 Jun 2003 06:03:31 -0700

On Thursday, June 26, 2003, at 10:05 PM, Sung-Gu wrote:

-1 My vote... :D <snip> I've made sample cases and posted it before. (even if it's not a normal junit testcase though.) And I'm not willing to make testcase for that. I'm not interested in unicode values... at all...

Sung-Gu, We are in this position because the simple test cases you've provided have not been sufficient for the rest of the developers to understand how these methods work. In fact, I am convinced that they are completely flawed and do not work correctly. We need a clear set of test cases to outline exactly what should come out of the methods given a particular input, without that we will probably never understand what the methods should do.

Whether or not you are interested in unicode values is largely irrelevant, there *has* to be test cases created for that code because currently it is believed to be broken, so we need a way to show conclusively that it either works or doesn't work.

To help you guys, you could find the above sentences (it means that's from RFC) and more specified how-to in a RFC describing FTP protocol, as I guess... - I don't remember at all... :( ...

If you don't know why the code would be useful or what it was implemented based upon, why is it that you still want it in HttpClient? There is nothing that uses those methods anywhere in HttpClient and the presence of an FTP RFC that requires them still wouldn't make them applicable to HttpClient since we aren't dealing with FTP.

Now, I have created a test case which clearly shows how the toUsingCharset method is broken:

public void testToUsingCharset() throws Exception {
        String input = String.valueOf('\u4E01');
        String temporary = URIUtil.toUsingCharset(input, "UTF-8", "Big5");
        String result = URIUTIL.toUsingCharset(temporary, "Big5", "UTF-8");
        assertEquals(input, result);
}

Currently this test case fails because it is not possible to convert a string to a bytes using one format and then convert it back into a String as a different format without mangling the characters. For reference:

* \u4E01 is a Chinese character. You can substitute \uCBBF for a wide range of Chinese characters and the test will still fail.

* Big5 is a very commonly used charset for Chinese characters.

* The testcase should succeed because if toUsingCharset works correctly, it should not loose information during the conversion or the data would be corrupted and in the case of a URL it would become ambiguous (by corrupting the data, two previously unique URLs could become identical).

The flaw in the toUsingCharset method is two-fold: Firstly, Strings in Java are *always* stored internally as UTF-8, regardless of the charset used for the byte array they were created from. So the contract defined by the JavaDoc for toUsingCharset is impossible since the method takes a string and returns a String (thus, the returned value will always be a String encoded as UTF-8 since that's how Java always stores Strings).

If you read the JavaDoc for the String constructor being used (String(byte[], String)), it says: "Constructs a new String by decoding the specified array of bytes using the specified charset." Note the use of the word "decoding" which means that instead of creating a String backed by the given byte array, it uses the specified charset to convert the bytes into actual characters - conceptually these characters have no particular encoding since they are (conceptually) the actual characters rather than a byte representation of the characters. In reality, the characters are represented in memory by a series of bytes in UTF-8 encoding as required by the JVM specification.

Secondly, the toUsingCharset method cannot work in most situations because it converts the string to bytes using one encoding and then converts those bytes to a String using a different encoding. To highlight why this cannot work, create a text file and save it to disk using ASCII encoding. Then, attempt to read the file back in as EBDIC encoding (or any double-byte character charset like UTF-16), the text will have become corrupted because the bytes were mapped to characters using the wrong charset (a charset is simply a mapping between bytes and characters).

So, the possible ways for toUsingCharset to fulfill it's contract is for it to be changed to:

public String toUsingCharset(String target, String fromCharset, String toCharset) { return target; }

OR to:

public byte[] toUsingCharset(String target, String toCharset) {
        return target.getBytes(toCharset);
}

OR to:

public byte[] toUsingCharset(byte[] target, String fromCharset, String toCharset) { return new String(target, fromCharset).getBytes(toCharset); }

The last one is the only one that makes any sense at all, but I fail to see how it is useful in HttpClient.

So Sung-Gu, please provide some justification for your -1 in terms of why the methods should remain in HttpClient - in particular where in HttpClient the method would be used and for what purpose.

Sung-Gu

Regards,

Adrian Sutton.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] Re: 2.0 release - deprecate some methods?

Reply via email to