On Tue, 28 Mar 2006 15:20:46 -0500 Ryan Bach <[EMAIL PROTECTED]> wrote:
> I work for the ITS department at my university, and we are trying to > find a way to gather information regarding the number of files of > various types that are found after a full crawl of our domain name. Do you want the type of the file as stored, or the type of the file as delivered? For example `index.php' is stored as "php" but delivered as "text/html", `weather_graph.pl' is stored as "perl script", but delivered as "image/gif". > htdig -t followed by a simple script to gather info from the document > database seems like it should do the trick; Hmmm, I'd do it with: wget -rS -o results.log -O /dev/null http://www.uni.edu and then the _delivered_ file types are on the lines labeled "Content-Type" in results.log. This does mean downloading the entire contents though, so if that were a problem I'd get one of the spider modules for Python (or Perl if you prefer it), and use it to get the headers only. To get the _stored_ types requires access to the real filesystem on disk of course, and you've got to figure out how to deal with http-server address rewriting & so on. Mike -- Mike Causer Email - mailto:[EMAIL PROTECTED] GPG KeyID 1C2DDA07 WWW - http://www.mikecauser.com Flood the fen again! - Wicken Fen enlargement - http://www.wicken.org.uk ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ ht://Dig general mailing list: <[email protected]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

