Re: htdig: virtual hosts revisited

Walter Hafner Tue, 15 Dec 1998 06:17:12 -0500
Geoff Hutchison writes:
 > At 5:28 AM -0500 12/14/98, Walter Hafner wrote:
 > >1) The lack of support for German umlauts (����)
 > 
 > My suggestion would be to look at the locale option.

Oops, sorry. I stand corrected. Missed that one.

 > >2) The somewhat limited queries.
 > 
 > I think you'll have to be more specific. I'd say we easily cover the 80/20
 > rule. From my search logs, most people put in text.

I'd like to have real substring search and case sensitive search. And
while I'm dreaming, a regexp subset would be nice. :-)

Simple prefix search sometimes is just too restricted. Imagine the words 
"prefix-search" vs. "prefix search": In the indexing step you'll end
with the database entries "prefixsearch" vs "prefix" and "search",
depending on the settings of valid_punctuations of course.

 > >3) The unability to distinguish virtual hosts from mere CHAMEs.
 > >I think that ht://Dig could 'borrow' a simple yet clever method to solve
 > >problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
 > >I have a look at Netscapes Compass Server. NCS gives the possibility
 > >for a "site probe". Here is a screen snipplet:
 > 
 > Actually NCS is being pretty naive in just using the size. The best way to
 > detect exact duplicates is with a checksum (e.g. md5sum). Since it's pretty
 > quick to generate a checksum, this isn't too slow.

I don't know, what algorithm NCS uses. I just did a "site probe" and
noticed accesses for both names (actual and alias). I have no idea, what
NCS does with this information. However, I think you're right in
suggesting MD5 checksums (I'm a FreeBSD admin, after all ... :-)

 > Though checking the root documents for checksums to determine duplicate
 > servers is an interesting idea, my personal approach would be to add in
 > checksumming in general for HTTP transations and detect duplicate documents
 > no matter where they appear. There's a patch around to detect duplicate
 > files based on inodes for filesystem digging, but I hesitate to add it
 > before adding an HTTP version.

That would be great, of course. As I wrote already: I don't know C++,
but I imagine that holding checksums for ~130.000 URLs (in my case)
results in HUGE memory consumption. hd://Dig 3.1.0b2 already wants 120
MB on my machine. :-)

 > We have lots of links on our website and it's annoying to see duplicates in
 > search results. But the problem with duplicate detection is deciding which
 > duplicate to use! My current thought is to use the document with the lower
 > hopcount.
 > 
 > Does this make sense?

As I wrote in another Mail: Why not use the lower hopcount, _unless_ the 
name is explicitely stated in the server_aliases ?

Regards,

-Walter

-- 
Walter Hafner_______________________________ [EMAIL PROTECTED]
       <A href=http://www.tum.de/~hafner/>*CLICK*</A>
 The best observation I can make is that the BSD Daemon logo
 is _much_ cooler than that Penguin :-)   (Donald Whiteside)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
Re: htdig: virtual hosts revisited

Reply via email to