Hi Gregory, On Fri, 16 Nov 2001, Gregory Kozlovsky wrote:
> Hello, Matt, > > You wrote: > >Images are only retrieved as a side effect of indexing dynamic content > since it > >not known until the document headers are examined what the content type is. > By > >this time the url must have been stored into urlword (via href discovery) > yet > >it will not store the images content when index processes the url, a side > >effect of how the indexer works. > > >It does not index IMG SRC tags directly. > > May be I misunderstand something. I thought that ASPSeek gets this URL from > an IMG SRC tag. Is this wrong? Where else can the spider find it? Can you Yes this is wrong. The indexer exhibits only mild clairvoyance :) If the image is generated by a script as an HREF then there is nothing at the point that the indexer discovers the URL that indicates that it is an image. It can't use the document suffix reliably to guess the Content-Type. Forseeably it is a link to another document to be indexed. The indexer is not able to determine that it should not index the HREF until it is fetched at which point it can examine the Content-Type header. However to get this far it must first add the URL to the urlword table. As an example, I could write a CGI or PHP or such script which generates a JPEG image. I might call this script http://my.host/genimage.php and link it into my home page as <a href=http://my.host/genimage.php>Click here to see an image</a> When the indexer arrives at http://my.host/home.php it has no way of knowing that the content type of this HREF (http://my.host/genimage.php) will be "image/jpeg" until it fetches it. So, it adds it to the document queue by way of adding it into the urlword table and eventually retrieves it. I'll correct one thing in my original post. The content is actually stored at this point. Basically, at this stage in the process, the parser says "is this documents type of an indexable nature" (i.e. does it have words in it that can be indexed) to which the answer should hopefully be no. If not then a message is logged and whatever content was retrived is stored and the URL status updated accordingly. It is impossible for the URL to match a query if it contains no words; to date at least, but I'm working on changing that (href text). No. IMG SRC tags are not followed. Matt. > please explain. > > Gregory Kozlovsky > > Project Manager for Information Systems Tel: +41 (01) 632 63 > 70 > International Relations and Security Network (ISN) Fax: +41 (01) 632 14 > 13 > Center for Security Studies and Conflict Research Email: > [EMAIL PROTECTED] > Swiss Federal Institute of Technology (ETH) http://www.isn.ch > Leonhardshalde 21, ETH-Zentrum / LEH > CH-8092 Z�rich, Switzerland > >
