Thanks again, everyone. Your suggestions are enlightening.

This last HEAD example shows that the eTag is identical 
for the two pages loaded. If this is usual behavior for 
HTTP 1.1 servers, I think the best solution will be to
1) compare urls. If the host part does not contain "www", 
I will check my DB for the same host name _with_ "www" 
(ex. softinnov.com --> www.softinnov.com)
2) If found, I will compare eTags.

The reason I prefer not to use 'checksum is that mirrored 
pages on different servers should appear, in my view, on 
the result list, whereas the mere "www" difference should 
make one of two results _not_ appear within the search 
results.

I didn't even bother checking if the eTag was identical. 
Thanks for showing me.
Hallvard



Dixit Tom Conlin <[EMAIL PROTECTED]> (Fri, 24 Oct 
2003 12:20:34 -0700 (PDT)):
>
>
>one more time with HEAD
>
>On Fri, 24 Oct 2003, Hallvard Ystad wrote:
>
>>
>> Thanks both.
>>
>> But theoretically, a these two URLs may very well not
>> represent the same document:
>> http://www.uio.no/
>> http://uio.no/
>> but still reside on the same server (same dns entry).
>>
>> So ...  Is it possible to _know_ whether or not these 
>>two
>> documents are the same without downloading their 
>>documents
>> and comparing them? (I really don't think so myself, but
>> someone might know something I don't.)
>>
>> I suddenly realize this has got very little to do with
>> Rebol. Sorry.
>>
>> Hallvard
>>
>> Dixit Tom Conlin <[EMAIL PROTECTED]> (Wed, 22 
>>Oct
>> 2003 10:00:08 -0700 (PDT)):
>> >
>> >On Wed, 22 Oct 2003, Hallvard Ystad wrote:
>> >
>> >>
>> >> Hi list
>> >>
>> >> My rebol stuff search engine now has more than 10000
>> >> entries, and works pretty fast thanks to DocKimbels
>> >>mysql
>> >> protocol.
>> >>
>> >> Here's a problem:
>> >> Some websites work both with and without the www 
>>prefix
>> >> (ex. www.rebol.com and just plain and simple 
>>rebol.com).
>> >> Sometimes this gives double records in my DB (ex.
>> >> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql :
>> >>you'll
>> >> see that both http://www.softinnov.com/bdd.html and
>> >> http://softinnov.com/bdd.html appears).
>> >>
>> >> Is there a way to detect such behaviour on a server? 
>>Or
>> >>do
>> >> I have to compare my incoming document to whatever
>> >> documents I already have in the DB that _might_ be 
>>the
>> >> same document?
>> >>
>> >> Thnaks,
>> >> Hallvard
>> >>
>> >> Pr?tera censeo Carthaginem esse delendam
>> >> --
>> >> To unsubscribe from this list, just send an email to
>> >> [EMAIL PROTECTED] with unsubscribe as the 
>>subject.
>> >>
>> >
>> >Hi Hallvard
>> >
>> >I ran into different reasons for finding more than one
>> >url to a page
>> >(URLs expressed as relative links)
>> >and wrote a QAD function that served my purpose at the
>> >time.
>> >
>> >just added Antons sugestion maybe it will serve
>> >
>> >
>> >do
>> >http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r
>> >
>> >canotical-url: func[ url /local t p q][
>> >    replace/all url "\" "/"
>> >    t: parse url "/"
>> >    while [p: find t ".."][remove remove back p]
>> >    while [p: find t "."][remove p]
>> >    p: find t ""
>> >    while [p <> q: find/last t ""][remove q]
>> >
>> >    ;;; this is untested
>> >    ;;; using Anton's sugguestion
>> >
>> >    if not find t/3 "www."[
>> >    if equal? read join dns:// t/3 read join dns://www. 
>>t/3
>> >    [insert t/3  "www."]
>> >    ]
>> >
>> >    for i 1 (length? t) - 1 1[append t/:i "/"]
>> >    to-url url-encode/re rejoin t
>> >]
>> >--
>> >To unsubscribe from this list, just send an email to
>> >[EMAIL PROTECTED] with unsubscribe as the 
>>subject.
>> >
>>
>> Pr?tera censeo Carthaginem esse delendam
>> --
>> To unsubscribe from this list, just send an email to
>> [EMAIL PROTECTED] with unsubscribe as the subject.
>>
>
>
>
>http-head: func[url [url!] /local port result][
>    port: open compose[
>        scheme: 'tcp
>        host: (first skip parse url "/" 2)
>        port-id: 80
>        timeout: 5
>    ]
>    insert port rejoin["HEAD " url " HTTP/1.0^/^/"]
>    wait port
>    result: copy port
>    close port
>    result
>]
>
>
>
>>> print http-head  http://www.softinnov.com/bdd.html
>HTTP/1.1 200 OK
>Date: Fri, 24 Oct 2003 19:14:38 GMT
>Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 
>mod_ssl/2.8.11
>OpenSSL/0.9.6c
>Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
>ETag: "39808c-168e-3f2a8ac7"
>Accept-Ranges: bytes
>Content-Length: 5774
>Connection: close
>Content-Type: text/html
>
>
>>> print http-head  http://softinnov.com/bdd.html
>HTTP/1.1 200 OK
>Date: Fri, 24 Oct 2003 19:14:46 GMT
>Server: Apache/1.3.27 (Unix) mod_gzip/1.3.19.1a PHP/4.2.3 
>mod_ssl/2.8.11
>OpenSSL/0.9.6c
>Last-Modified: Fri, 01 Aug 2003 15:44:07 GMT
>ETag: "39808c-168e-3f2a8ac7"
>Accept-Ranges: bytes
>Content-Length: 5774
>Connection: close
>Content-Type: text/html
>
>-- 
>To unsubscribe from this list, just send an email to
>[EMAIL PROTECTED] with unsubscribe as the subject.
>

Prętera censeo Carthaginem esse delendam
-- 
To unsubscribe from this list, just send an email to
[EMAIL PROTECTED] with unsubscribe as the subject.

Reply via email to