[aseek-users] Re: ASPSeek

Matt Sullivan Fri, 16 Nov 2001 03:39:24 -0800

Hi Gregory,

On Fri, 16 Nov 2001, Gregory Kozlovsky wrote:

> Hello, Matt,
> 
> You wrote:
> >Images are only retrieved as a side effect of indexing dynamic content
> since it
> >not known until the document headers are examined what the content type is.
> By
> >this time the url must have been stored into urlword (via href discovery)
> yet
> >it will not store the images content when index processes the url, a side
> >effect of how the indexer works. 
> 
> >It does not index IMG SRC tags directly. 
> 
> May be I misunderstand something. I thought that ASPSeek gets this URL from
> an IMG SRC tag. Is this wrong? Where else can the spider find it? Can you

Yes this is wrong.  The indexer exhibits only mild clairvoyance :)  If the
image is generated by a script as an HREF then there is nothing at the point
that the indexer discovers the URL that indicates that it is an image.  It
can't use the document suffix reliably to guess the Content-Type.  Forseeably
it is a link to another document to be indexed. The indexer is not able to
determine that it should not index the HREF until it is fetched at which point
it can examine the Content-Type header.  However to get this far it must first
add the URL to the urlword table. 

As an example, I could write a CGI or PHP or such script which generates a JPEG
image.  I might call this script http://my.host/genimage.php and link it into
my home page as <a href=http://my.host/genimage.php>Click here to see an
image</a> When the indexer arrives at http://my.host/home.php it has no way of
knowing that the content type of this HREF (http://my.host/genimage.php) will
be "image/jpeg" until it fetches it.  So, it adds it to the document queue by
way of adding it into the urlword table and eventually retrieves it. 

I'll correct one thing in my original post.  The content is actually stored at
this point.  Basically, at this stage in the process, the parser says "is this
documents type of an indexable nature" (i.e. does it have words in it that can
be indexed) to which the answer should hopefully be no. If not then a message
is logged and whatever content was retrived is stored and the URL status
updated accordingly. 

It is impossible for the URL to match a query if it contains no words; to date
at least, but I'm working on changing that (href text). 

No.  IMG SRC tags are not followed. 

Matt.

> please explain.
> 
>         Gregory Kozlovsky
> 
> Project Manager for Information Systems               Tel: +41 (01) 632 63
> 70
> International Relations and Security Network (ISN)    Fax: +41 (01) 632 14
> 13
> Center for Security Studies and Conflict Research     Email:
> [EMAIL PROTECTED]
> Swiss Federal Institute of Technology (ETH)           http://www.isn.ch
> Leonhardshalde 21, ETH-Zentrum / LEH
> CH-8092 Z�rich, Switzerland
> 
>

[aseek-users] Re: ASPSeek

Reply via email to