Okay, saw the code in the http-protocol plugin. I
remember looking at this about a year ago. RFC 2616
(HTTP/1.1) does say, as Jerome pointed out:

"A server MUST NOT send transfer-codings to an
HTTP/1.0 client."

Regardless, I can attest that there are servers out
there that return chunked content regardless of the
client.

We had a socket implementation akin to the
HttpResponse.java in http-protocol plugin and were
stumped on how to handle identifying whether the
response was chunked or not - as we could not reliably
use the Transfer-coding header. The only way we could
see was trying to use the initial hex characters
denoting the size of the first chunk.

"The chunk-size field is a string of hex digits
indicating the size of the chunk. The chunked encoding
is ended by any chunk whose size is zero, followed by
the trailer, which is terminated by an empty line." -
more from RFC 2616

But in practice this was error prone. Switching over
to apache httpclient eliminated this problem, as it
transparently handles chunked and un-chunked content.
But httpclient is much more heavy weight and so the
conversion could only be done after implementing some
basic resource pooling on the primary httpclient
object. 

It does look like this would be a serious refactor job
as nutch uses all java.net classes. On the other hand,
it might simplify some areas of the nutch protocol
classes and httpclient does have some interesting
built in support for multi-threading/performance
tuning requests.

I hope this helps towards a solution.

Best Regards,

Chris

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Chris Fellows wrote:
> > Just remembered, got around it by using HTTPClient
> > which handles reading the response (chunked or
> not)
> > transparently. Haven't looked at the nutch code,
> but
> > if we were to use HTTPClient 3.0.x or later,
> should
> > take care of it.
> >
> >   
> 
> Take a look at protocol-httpclient. This discussion
> is on whether/how to 
> fix protocol-http. The other plugin already supports
> this.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 

Reply via email to