Yep, that's my position. HTTP (and hence HttpClient) is about moving the data around. The _interpretation_ of the data should be left up to a higher application layer.
Marc Saegesser > -----Original Message----- > From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 21, 2002 12:57 PM > To: Jakarta Commons Developers List > Subject: RE: [HttpClient]Encoding > > > I think I understand why Marc wanted to leave that level of support > outside the HttpClient API. As long as the client deals strictly with > binary streams there is no reason why a higher level part of > the application > can't handle the encoding issues. > > In essence, Marc has said that the fact that the API doesn't handle > encoding is intentional, and the ...AsString methos is not > really meant to > be used for proper display encoding. I think that in order > to handle the > enocoding properly the way that Marc described would require > changes to the > actual API, and is really best handled somewhere else. > > - Rapheal Kaplan > > -----Original Message----- > From: Sung-Gu [mailto:[EMAIL PROTECTED]] > Sent: Thursday, March 21, 2002 1:18 PM > To: [EMAIL PROTECTED] > Cc: Slide Developers Mailing List > Subject: Re: [HttpClient]Encoding > > > > I'm sure that you guys're talking about character set(= > character encoding > in MIME) in HTTP. I added my comment below. ;) > > > Sung-Gu > > ----- Original Message ----- > Subject: Re: [HttpClient]Encoding > > > > I'll see about changing getResponseBodyAsString() to use > the charset from > > the content-type (if it exists). I'm up to my ears with > day job work > right > > now, so it'll probably be a while before I can get to it. > > I think we'll need to support language tags (within the > Accept-Language and > Content-Language fields) and Accept and Content-Type (for > internet media > types) at some point. > > > > > People still need to understand (and I'll improve the JavaDoc) that > > getResponseBodyAsString() is never really going to be all > that useful in > the > > real world. From HttpClient's perspective the response > body is simply a > > sequence of bytes, nothing more. It is up to a higher > application layer > to > > actually *interpret* those bytes based on the mime type > specified in the > > content-type header. > > > > Marc Saegesser > > > > > -----Original Message----- > > > From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]] > > > Sent: Wednesday, March 20, 2002 1:53 PM > > > To: Jakarta Commons Developers List > > > Subject: Re: [HttpClient]Encoding > > > > > > > > > Makes sense to me. Because the encoding is handled in the > > > body itself, it > > > doesn't necessarily help that much to set the encoding in the > > > getResponseBodyAsString method. Also, this kind of means > > > that you can't rely > > > on the getResponseBodyAsString method for all purposes. > > > There needs to be > > > some other layer of a client application that manages encoding. > > > > > > I still see the use of get...AsString, of course. It could > > > be an inbetween > > > step that is sent to a parser to determine actual encoding, > > > but then you > > > would need to return to the original byte stream anyway to > > > re-string the > > > body. Maybe the documentation should reflect this information. > > > > > > Also, if people start using charset info in the future, it > > > would probably > > > be nice to provide support. It might be that doing body to > > > string conversion > > > should be somewhere else in the API. Any ideas? > > > > > > My first guess would be to have a utility class that can do > > > the correct > > > encoding, from both the header and maybe even parsing the > > > content. However, > > > I don't think I am framiliar enough with the API to say decisivly. > > > > > > I do know that such features might be very useful for some work > > > that I need to do in the near future. I am working one > > > software that needs > > > to interact with several languages with non-latin character sets. > > In your pre-mail, > > For example, if the client is requesting a document written > > in Chinese, it > > could well use an entirely different encoding. > > if you want to solve this problem in the only perspective of character > encoding, > you should consider of the conversion from/to local > character set to/from > transfer character set in the client/server side. > > We can go more complicately! > If you use mixed non-ascii characters (Korean and Chinese... > ), you should > provide to handle to bi-directional display for these > character sets. Then > you should take a two step process for conversion from/to > local character > set to/from UTF-8? First, convert the local character set to the UCS. > Second, convert UCS to UTF-8. How complicated, huh? > > And one more! > Some old clients or servers doesn't support 8 bit transfer > encoding like > UTF-8. Then what? We should check that the code is valid > UTF-8 or not. > > > However, there is an eaiser way to solve this problem. > ( I WANT to say this a bit! ^^ ) > That's to use "escaped encoding" that includes ASCII > character set only. > It looks like application/x-www-form-urlencoded for media > type in HTML. > But it's somewhat different. > > > > > > > - Rapheal Kaplan > > > > > > > > > > > > On Wednesday 20 March 2002 14:27, you wrote: > > > > I've had to deal with this problem myself. Right now the > > > only solution is > > > > to use getResponseBody() and convert bytes into a > string using the > > > > appropriate encoding. I like the idea of having > > > getResponseBodyAsString() > > > > use the encoding specified in the Content-Type header, but > > > the problem is > > > > that it still won't be very useful. > > > > > > > > The vast majority of web servers out there don't include a > > > "; charset=" > > > > attribute in the content-type header or provide a > > > reasonable mechanism for > > > > content authors to cause the server to set the attribute > > > correctly on a > > > > per-file basis. Most pages with non-ISO-LATIN-1 > charsets use <META > > > > HTTP-EQUIV> tag in the HTML header to specify the page > > > encoding. That > > > > means you still have to read at least part of the > response body (as > > > > ISO-LATIN-1) in order to determine the correct encoding. > > > > > > > > I don't have a problem with changing > > > getResponseBodyAsString() to check the > > > > content-type header, I just doubt that doing that will make > > > it much more > > > > useful in the real world. > > > > > > > > What do others think? > > > > > > > > Marc Saegesser > > > > > > > > > > > > > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>