On Fri, 2011-04-01 at 18:14 -0400, Chad La Joie wrote:
> Here you go:
> http://shibboleth.net/dumps.tgz
> 
> I found a much smaller document than the one I was initially testing
> with.  It's off by one byte.
> 
> On 4/1/11 9:38 AM, Oleg Kalnichevski wrote:
> > On Fri, 2011-04-01 at 09:06 -0400, Chad La Joie wrote:
> >> I'm experiencing an odd problem with HttpClient 4.1.1.  I perform a GET
> >> on a document and then use the following code to get the bytes of the
> >> response entity (assuming a 200 status code):
> >>
> >> byte[] responseEntity = EntityUtils.toByteArray(response.getEntity());
> >>
> >> The problem I'm having is that this returns 16 fewer bytes than are
> >> actually in the document.  So far I've checked:
> >>  - that downloading the file via wget gives me the expected byte account
> >>  - that the downloaded content is not compressed
> >>
> >> The document itself has a digital signature over it and this is failing
> >> to validate with the content as downloaded by HttpClient, but not with
> >> the document downloaded by wget so there is some material difference in
> >> the canonical form of the document (i.e., it's not just a lack of a new
> >> line at the start/end of the document).
> >>
> >> Any thoughts?  Is EntityUtils.toByteArray not the right method to use to
> >> get the complete byte[] of the response entity?
> >>
> >> Thanks.
> > 

This is a problem with content decoding. 

<< "HTTP/1.1 200 OK[\r][\n]"
<< "Date: Fri, 01 Apr 2011 14:14:28 GMT[\r][\n]"
<< "Server: Apache/2.2.3 (Unix) mod_ssl/2.2.3 OpenSSL/0.9.7d[\r][\n]"
<< "Last-Modified: Thu, 31 Mar 2011 17:00:05 GMT[\r][\n]"
<< "ETag: "4d0e-39932f40"[\r][\n]"
<< "Accept-Ranges: bytes[\r][\n]"
<< "Content-Length: 19726[\r][\n]"
<< "Keep-Alive: timeout=5, max=99[\r][\n]"
<< "Connection: Keep-Alive[\r][\n]"
<< "Content-Type: application/xml[\r][\n]"
<< "[\r][\n]"
<< "<?xml version="1.0" encoding="UTF-8"?><!--[\n]"

The response content is clearly UTF-8 encoded. However the Content-Type
header does not specify a charset. Per HTTP specification if content
charset is not explicitly set in the Content-Type content charset is
assumed to be ISO-8859-1

Oleg  


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org

Reply via email to