On Tue, Jul 06, 2004 at 02:21:00PM -0700, Doug Cutting wrote:
> Andy Hedges wrote:
> >I just been looking at Jakarta Commons HttpClient 
> >(http://jakarta.apache.org/commons/httpclient/) as I would like to 
> >refactor HttpResponse to use it (as it will make the isTruncate flag 
> >easier to implement for this fetcher).
> 
> This is what Heritrix uses too.  Perhaps Nutch should switch to it. 
> When I first wrote the fetcher I couldn't find an Http library that was 
> robust enough, i.e., that implemented things like socket connect 
> timeouts and content truncation.  So I wrote my own.  But if this one 
> does the trick, I don't have a problem using it.
> 
> >It handles headers really nicely 
> >and perhaps we could take a leaf from their book? It basically 
> >represents them using three classes: HeadMethod, Header, HeaderElement. 
> >More information can be found here 
> >http://jakarta.apache.org/commons/httpclient/apidocs/.
> 
> I'm not sure that's the best API for generic metadata: it's pretty 
> Http-specific, and it's also not very convenient to access.  I think a 
> map that supports multiple values wouldn't lose any information, would 
> be more generic, and would be simpler to use, no?
> 
> >Firstly I would be interested in what the consensus is on using external 
> >libraries (I know we already us a few) and secondly whether people 
> >though this is a sensible one to use - for me it saves a lot of 
> >reinventing the wheel for the http handling.

Handling of metadata, including http headers, should be common to
all clients: http://, ftp://, file://, etc.
The isTruncate flag is only one piece info in metadata.
We can borrow better ideas from httpclient (or even "steal" code segement).

> 
> I don't have a problem using external libaries.  This one, in particular 
> looks very promising.  For example, they appear to support connect timeouts:
> 
> http://jakarta.apache.org/commons/httpclient/apidocs/org/apache/commons/httpclient/HttpConnection.html#setConnectionTimeout(int)
> 
> So please feel free to contribute an Http protocol implementation which 
> uses this library.  If it is at least as robust as what we have, then we 
> should probalby use it as our default http implementation.

It is also the time to think about some features,
that people have expressed interests before, such as:
handling pages with frames and javascripts. Parsing them
needs support from client.

> 
> On a related note, I've been thinking that the host delay logic 
> (blockAddr() in Http.java) should probably be moved to Fetcher.java, as 
> this is not unique to Http.  Does that make sense to others?

I will take care of that.

John


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to