On Tue, Jul 06, 2004 at 02:21:00PM -0700, Doug Cutting wrote: > Andy Hedges wrote: > >I just been looking at Jakarta Commons HttpClient > >(http://jakarta.apache.org/commons/httpclient/) as I would like to > >refactor HttpResponse to use it (as it will make the isTruncate flag > >easier to implement for this fetcher). > > This is what Heritrix uses too. Perhaps Nutch should switch to it. > When I first wrote the fetcher I couldn't find an Http library that was > robust enough, i.e., that implemented things like socket connect > timeouts and content truncation. So I wrote my own. But if this one > does the trick, I don't have a problem using it. > > >It handles headers really nicely > >and perhaps we could take a leaf from their book? It basically > >represents them using three classes: HeadMethod, Header, HeaderElement. > >More information can be found here > >http://jakarta.apache.org/commons/httpclient/apidocs/. > > I'm not sure that's the best API for generic metadata: it's pretty > Http-specific, and it's also not very convenient to access. I think a > map that supports multiple values wouldn't lose any information, would > be more generic, and would be simpler to use, no? > > >Firstly I would be interested in what the consensus is on using external > >libraries (I know we already us a few) and secondly whether people > >though this is a sensible one to use - for me it saves a lot of > >reinventing the wheel for the http handling.
Handling of metadata, including http headers, should be common to all clients: http://, ftp://, file://, etc. The isTruncate flag is only one piece info in metadata. We can borrow better ideas from httpclient (or even "steal" code segement). > > I don't have a problem using external libaries. This one, in particular > looks very promising. For example, they appear to support connect timeouts: > > http://jakarta.apache.org/commons/httpclient/apidocs/org/apache/commons/httpclient/HttpConnection.html#setConnectionTimeout(int) > > So please feel free to contribute an Http protocol implementation which > uses this library. If it is at least as robust as what we have, then we > should probalby use it as our default http implementation. It is also the time to think about some features, that people have expressed interests before, such as: handling pages with frames and javascripts. Parsing them needs support from client. > > On a related note, I've been thinking that the host delay logic > (blockAddr() in Http.java) should probably be moved to Fetcher.java, as > this is not unique to Http. Does that make sense to others? I will take care of that. John ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
