RE: [HttpClient]Encoding

Marc Saegesser Thu, 21 Mar 2002 10:58:15 -0800

Yep, that's my position.  HTTP (and hence HttpClient) is about moving the
data around.  The _interpretation_ of the data should be left up to a
higher application layer.


Marc Saegesser 

> -----Original Message-----
> From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, March 21, 2002 12:57 PM
> To: Jakarta Commons Developers List
> Subject: RE: [HttpClient]Encoding
> 
> 
>   I think I understand why Marc wanted to leave that level of support
> outside the HttpClient API.  As long as the client deals strictly with
> binary streams there is no reason why a higher level part of 
> the application
> can't handle the encoding issues.
> 
>   In essence, Marc has said that the fact that the API doesn't handle
> encoding is intentional, and the ...AsString methos is not 
> really meant to
> be used for proper display encoding.  I think that in order 
> to handle the
> enocoding properly the way that Marc described would require 
> changes to the
> actual API, and is really best handled somewhere else.
> 
>   - Rapheal Kaplan
> 
> -----Original Message-----
> From: Sung-Gu [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, March 21, 2002 1:18 PM
> To: [EMAIL PROTECTED]
> Cc: Slide Developers Mailing List
> Subject: Re: [HttpClient]Encoding
> 
> 
> 
> I'm sure that you guys're talking about character set(= 
> character encoding
> in MIME) in HTTP.  I added my comment below.  ;)
> 
> 
> Sung-Gu
> 
> ----- Original Message -----
> Subject: Re: [HttpClient]Encoding
> 
> 
> > I'll see about changing getResponseBodyAsString() to use 
> the charset from
> > the content-type (if it exists).  I'm up to my ears with 
> day job work
> right
> > now, so it'll probably be a while before I can get to it.
> 
> I think we'll need to support language tags (within the 
> Accept-Language and
> Content-Language fields) and Accept and Content-Type (for 
> internet media
> types) at some point.
> 
> >
> > People still need to understand (and I'll improve the JavaDoc) that
> > getResponseBodyAsString() is never really going to be all 
> that useful in
> the
> > real world.  From HttpClient's perspective the response 
> body is simply a
> > sequence of bytes, nothing more.  It is up to a higher 
> application layer
> to
> > actually *interpret* those bytes based on the mime type 
> specified in the
> > content-type header.
> >
> > Marc Saegesser
> >
> > > -----Original Message-----
> > > From: Rapheal Kaplan [mailto:[EMAIL PROTECTED]]
> > > Sent: Wednesday, March 20, 2002 1:53 PM
> > > To: Jakarta Commons Developers List
> > > Subject: Re: [HttpClient]Encoding
> > >
> > >
> > >   Makes sense to me.  Because the encoding is handled in the
> > > body itself, it
> > > doesn't necessarily help that much to set the encoding in the
> > > getResponseBodyAsString method.  Also, this kind of means
> > > that you can't rely
> > > on the getResponseBodyAsString method for all purposes.
> > > There needs to be
> > > some other layer of a client application that manages encoding.
> > >
> > >   I still see the use of get...AsString, of course.  It could
> > > be an inbetween
> > > step that is sent to a parser to determine actual encoding,
> > > but then you
> > > would need to return to the original byte stream anyway to
> > > re-string the
> > > body.  Maybe the documentation should reflect this information.
> > >
> > >   Also, if people start using charset info in the future, it
> > > would probably
> > > be nice to provide support.  It might be that doing body to
> > > string conversion
> > > should be somewhere else in the API.  Any ideas?
> > >
> > >   My first guess would be to have a utility class that can do
> > > the correct
> > > encoding, from both the header and maybe even parsing the
> > > content.  However,
> > > I don't think I am framiliar enough with the API to say decisivly.
> > >
> > >   I do know that such features might be very useful for some work
> > > that I need to do in the near future.  I am working one
> > > software that needs
> > > to interact with several languages with non-latin character sets.
> 
> In your pre-mail,
> > For example, if the client is requesting a document written
> > in Chinese, it
> > could well use an entirely different encoding.
> 
> if you want to solve this problem in the only perspective of character
> encoding,
> you should consider of the conversion from/to  local 
> character set to/from
> transfer character set in the client/server side.
> 
> We can go more complicately!
> If you use mixed non-ascii characters (Korean and Chinese... 
> ), you should
> provide to handle to bi-directional display for these 
> character sets.   Then
> you should take a two step process for conversion from/to 
> local character
> set to/from UTF-8?  First, convert the local character set to the UCS.
> Second, convert UCS to UTF-8. How complicated, huh?
> 
> And one more!
> Some old clients or servers doesn't support 8 bit transfer 
> encoding like
> UTF-8. Then what?  We should check that the code is valid 
> UTF-8 or not.
> 
> 
> However, there is an eaiser way to solve this problem.
> ( I WANT to say this a bit!  ^^ )
> That's to use "escaped encoding" that includes ASCII 
> character set only.
> It looks like application/x-www-form-urlencoded for media 
> type in HTML.
> But it's somewhat different.
> 
> > >
> > >   - Rapheal Kaplan
> > >
> > >
> > >
> > > On Wednesday 20 March 2002 14:27, you wrote:
> > > > I've had to deal with this problem myself.  Right now the
> > > only solution is
> > > > to use getResponseBody() and convert bytes into a 
> string using the
> > > > appropriate encoding.  I like the idea of having
> > > getResponseBodyAsString()
> > > > use the encoding specified in the Content-Type header, but
> > > the problem is
> > > > that it still won't be very useful.
> > > >
> > > > The vast majority of web servers out there don't include a
> > > "; charset="
> > > > attribute in the content-type header or provide a
> > > reasonable mechanism for
> > > > content authors to cause the server to set the attribute
> > > correctly on a
> > > > per-file basis.  Most pages with non-ISO-LATIN-1 
> charsets use <META
> > > > HTTP-EQUIV> tag in the HTML header to specify the page
> > > encoding.  That
> > > > means you still have to read at least part of the 
> response body (as
> > > > ISO-LATIN-1) in order to determine the correct encoding.
> > > >
> > > > I don't have a problem with changing
> > > getResponseBodyAsString() to check the
> > > > content-type header, I just doubt that doing that will make
> > > it much more
> > > > useful in the real world.
> > > >
> > > > What do others think?
> > > >
> > > > Marc Saegesser
> > > >
> > >
> >
> >
> >
> 
> 
> --
> To unsubscribe, e-mail:   
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: 
> <mailto:[EMAIL PROTECTED]>
> 

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: [HttpClient]Encoding

Reply via email to