Re: htdig: virtual hosts revisited

1998-12-15 Thread Gilles Detillieux
According to Geoff Hutchison: > As far as using the HEAD for the checksum, my point is that most documents > we already GET, so we don't save any bandwidth. I'm also not completely > sure that there's enough to ensure the checksums are unique. (This is why > I'd want to test the feature very thoro

Re: htdig: virtual hosts revisited

1998-12-15 Thread Geoff Hutchison
At 2:56 AM -0500 12/15/98, Walter Hafner wrote: >And a second thought on document checksums: Quite often I see root >documents more than 100 kb in size. Why not just computing a checksum of >the header. A HTTP 1.1 'HEAD' command would be sufficient and imho this >would save a lot of bandwidth and

Re: htdig: virtual hosts revisited

1998-12-15 Thread Geoff Hutchison
At 8:00 AM -0500 12/15/98, John Grohol PsyD wrote: >How about a file_aliases option? For instance, on our server, >the index.html file is nearly always a symbolic link to the >actual file, which is named something different. If I could >put "index.html" into a file_aliases option, I would solve a

Re: htdig: virtual hosts revisited

1998-12-15 Thread Geoff Hutchison
At 5:28 AM -0500 12/15/98, Walter Hafner wrote: >I'd like to have real substring search and case sensitive search. And >while I'm dreaming, a regexp subset would be nice. :-) There's already a substring search. As for case_sensitive, it's an idea, but currently all words are stored as lowercase,

Re: htdig: virtual hosts revisited

1998-12-15 Thread John Grohol PsyD
Geoff Hutchison writes: > > We have lots of links on our website and it's annoying to see duplicates in > > search results. But the problem with duplicate detection is deciding which > > duplicate to use! My current thought is to use the document with the lower > > hopcount. Walter Hafner replie

Re: htdig: virtual hosts revisited

1998-12-15 Thread Walter Hafner
Geoff Hutchison writes: > At 5:28 AM -0500 12/14/98, Walter Hafner wrote: > >1) The lack of support for German umlauts (äöüß) > > My suggestion would be to look at the locale option. Oops, sorry. I stand corrected. Missed that one. > >2) The somewhat limited queries. > > I think you'll

Re: htdig: virtual hosts revisited

1998-12-15 Thread Walter Hafner
Webmaster writes: > Geoff Hutchison writes: > >We have lots of links on our website and it's annoying to see duplicates in > >search results. But the problem with duplicate detection is deciding which > >duplicate to use! My current thought is to use the document with the lower > >hopcount.

Re: htdig: virtual hosts revisited

1998-12-14 Thread Webmaster
Earlier part of message deleted for brevities sake >Actually NCS is being pretty naive in just using the size. The best way to >detect exact duplicates is with a checksum (e.g. md5sum). Since it's pretty >quick to generate a checksum, this isn't too slow. > >Though checking the root documents for

Re: htdig: virtual hosts revisited

1998-12-14 Thread Geoff Hutchison
At 5:28 AM -0500 12/14/98, Walter Hafner wrote: >1) The lack of support for German umlauts (äöüß) My suggestion would be to look at the locale option. >2) The somewhat limited queries. I think you'll have to be more specific. I'd say we easily cover the 80/20 rule. From my search logs, most peo

htdig: virtual hosts revisited

1998-12-14 Thread Walter Hafner
Hi! I'm in the process of evaluating Webcrawler software for full-text indexing purposes. Currently we use ht://Dig 3.1.0b2 for indexing the whole *.tu-muenchen.de domain. The domain consists of ~300 WWW Servers, that answer to ~540 names (vitual hosts _and_ server aliases). All in all there are