We had some interest in indexing a number of images and photographs at our site, many of which are online on the Web. I was wondering if I could use htDig to help do this.
I realized that hdDig saved keywords from the link in the referring document, such as <a href="eagle.jpeg">picture of an eagle</a> or even <img src="osprey.jpg" alt="picture of an osprey"> but that normally these are discarded, since image/jpeg is not an indexable type. Without making any substantial changes to htDig, but by creating an external parser for image data types, I can retain these keywords and thus index images. The first version of the parser read only a few kb from the beginning of the image file, enough to extract metadata such as image size, comments etc, and returned text with a magic word "XFILE". It was the possible to search for images in a mostly text corpus using a string such as "xfile eagle". The next version read the entire image (which in some cases required a doc_size of 1Mb or more) and created a thumbnail image stored locally under a unique name. When a context string is created with a link to the thumbnail, it is possible to put inline thumbnails in the search results in a slimilar way to Google or AltaVista image search. In order to do this I backed out some code to de-fang HTML in the context text. It is not quite right as there is a problem with bolding if text in the link itself matches (such as the filename) Example search at http://andrew.triumf.ca/htdig/search.html e.g. "xfile chamber" scripts at http://andrew.triumf.ca/htdig/mods/ The basic idea (providing a script to index a non-text media type based on filename, metadata and link text) will work with an unmodified htdig, hence the cc. to htdig-general Andrew Daviel TRIUMF ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

