Hi!

I'm in the process of evaluating Webcrawler software for full-text
indexing purposes.

Currently we use ht://Dig 3.1.0b2 for indexing the whole
*.tu-muenchen.de domain. The domain consists of ~300 WWW Servers, that
answer to ~540 names (vitual hosts _and_ server aliases). All in all
there are ~130.000 documents to index. You can have a look at:
http://tum-index.ze.tu-muenchen.de/ (german).

This amount of data shows the limits of ht://Dig. :-)

While I'm quite happy with ht://Dig in general, there are a few things
that annoy me:

1) The lack of support for German umlauts (äöüß)
2) The somewhat limited queries.
3) The unability to distinguish virtual hosts from mere CHAMEs.

Unfortunately I don't know C++ at all, so I can't supply patches. If the
code was in C ...

I think that ht://Dig could 'borrow' a simple yet clever method to solve
problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
I have a look at Netscapes Compass Server. NCS gives the possibility
for a "site probe". Here is a screen snipplet:


------------------------------------------------------------
Site: http://gi.vo.tum.de:80/

[x]  Show advanced DNS information 


Checking URL: Doing DNS lookup.... 

GetHostByName() results for 'gi.vo.tum.de':
h_error: 0 - successful DNS query
Name:    w3proj1.ze.tu-muenchen.de
aliases: gi.vo.tum.de
addrtype:2
length:  4
ip:      129.187.102.4

Result: gi.vo.tum.de is a valid name. 
Note: gi.vo.tum.de appears to be an alias for the machine named 
w3proj1.ze.tu-muenchen.de . 

Checking URL for Redirect: Trying to Connect to Site... 
Result: No Server redirect detected at http://gi.vo.tum.de:80/ 

Checking host for Virtual Server: Trying to Connect to Site... 
Result: http://gi.vo.tum.de:80/ is really a virtual server being hosted on the server
w3proj1.ze.tu-muenchen.de. 
------------------------------------------------------------


To distinguish virtual hosts from server aliases, NCS simply contacts
the two addresses that were returned by "GetHostByName()":


============================================================
gi.access_log:
sunhalle1.informatik.tu-muenchen.de - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0" 
200 911

tum.access_log:
sunhalle1.informatik.tu-muenchen.de - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0" 
200 2301
============================================================

(NCS runs on "sunhalle1"). NCS simply compares the root documents of the
two addresses. If they are the same, the alias is _possibly_ a server
alias of the server, if they are different, the alias is a virtual host.

There might be some problems, if one machine hosts several virtual
hosts, but in general that's a feature I'd _love_ to see in
ht://Dig. The last time I checked, ht://Dig indexed ~280.000 documents
in our domain, where 130.000 is a more realistic number. The
"server_aliases" directive didn't help much either. There are simply way 
too much hosts to dael manually with!

Any comments to my suggestion?

-Walter Hafner

-- 
Walter Hafner_______________________________ [EMAIL PROTECTED]
       <A href=http://www.tum.de/~hafner/>*CLICK*</A>
 The best observation I can make is that the BSD Daemon logo
 is _much_ cooler than that Penguin :-)   (Donald Whiteside)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to