On Tue, 28 Mar 2006 15:20:46 -0500 Ryan Bach <[EMAIL PROTECTED]> wrote:

> I work for the ITS department at my university, and we are trying to
> find a way to gather information regarding the number of files of
> various types that are found after a full crawl of our domain name.

Do you want the type of the file as stored, or the type of the file as
delivered?  For example `index.php' is stored as "php" but delivered
as "text/html", `weather_graph.pl' is stored as "perl script", but
delivered as "image/gif".


> htdig -t followed by a simple script to gather info from the document
> database seems like it should do the trick; 


Hmmm,  I'd do it with:
  wget  -rS  -o results.log  -O /dev/null   http://www.uni.edu

and then the _delivered_ file types are on the lines labeled
"Content-Type" in results.log.  This does mean downloading the entire
contents though, so if that were a problem I'd get one of the spider
modules for Python (or Perl if you prefer it), and use it to get the
headers only.

To get the _stored_ types requires access to the real filesystem on disk
of course, and you've got to figure out how to deal with http-server
address rewriting & so on.



Mike
-- 
Mike Causer                          Email - mailto:[EMAIL PROTECTED]
GPG KeyID 1C2DDA07                       WWW - http://www.mikecauser.com
Flood the fen again! - Wicken Fen enlargement - http://www.wicken.org.uk


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to