At 7:50 AM -0400 4/16/99, Gabriele Bartolini wrote:
>Tell me if I am wrong. Is possible to use ETag response-header to
>"identify" a document on the WEB? I've made some tests? If I ask the URL
>"http://www.comune.prato.it" and then
>"http://www.comune.prato.it/home.htm", it gives me the same ETag value as
>response. Has it any meaning, for you?

You're calling me on all of the projects I've planned! :-) Good for you for
noticing the same thing. Unfortunately the standard doesn't agree with us.
:-(

   The use of the same entity tag value in conjunction with entities
obtained by
   requests on different URIs does not imply the equivalence of those
   entities.

http://www.w3.org/Protocols/HTTP/1.1/draft-ietf-http-v11-spec-rev-06.txt

>Why don't we store it and use it to compare 2 docs? This would permit to
>store the same document only once.
..
>Another way to avoid storing more than once the same document, coulb be to
>compare the size and the modification date of the docs.

Yes, this is my idea for all the people worrying about duplicate documents.
It's the best solution for realizing documents are identical. Store some
sort of checksum, either of the document itself or its header information.
Then do a lookup before parsing a new document.

This gets a little more complex when trying to identify identical
*servers*. Obviously we'd prefer not to index a server twice since we'd
just throw out all the documents! But what if the first few documents we
index on a server are duplicates? Some people have suggested you use the
robots.txt file, but not all servers have them. I'm still not sure if
there's a good solution.

>Probably, if it was possible, you would have already adopted this solution
>!!! But, who knows ...

Too many ideas, not enough time. :-( The other problem is what to do with
duplicate documents. One idea is to throw away the duplicates. The other is
to store all the URLs so people know the document is 'mirrored.' If you
throw them away, I would suggest that we pick the URL with the lowest
hop_count.

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to