Hi!
I'm in the process of evaluating Webcrawler software for full-text
indexing purposes.
Currently we use ht://Dig 3.1.0b2 for indexing the whole
*.tu-muenchen.de domain. The domain consists of ~300 WWW Servers, that
answer to ~540 names (vitual hosts _and_ server aliases). All in all
there are ~130.000 documents to index. You can have a look at:
http://tum-index.ze.tu-muenchen.de/ (german).
This amount of data shows the limits of ht://Dig. :-)
While I'm quite happy with ht://Dig in general, there are a few things
that annoy me:
1) The lack of support for German umlauts (äöüß)
2) The somewhat limited queries.
3) The unability to distinguish virtual hosts from mere CHAMEs.
Unfortunately I don't know C++ at all, so I can't supply patches. If the
code was in C ...
I think that ht://Dig could 'borrow' a simple yet clever method to solve
problem (3). As I wrote, I evaluate alternatives to ht://Dig. Currently
I have a look at Netscapes Compass Server. NCS gives the possibility
for a "site probe". Here is a screen snipplet:
------------------------------------------------------------
Site: http://gi.vo.tum.de:80/
[x] Show advanced DNS information
Checking URL: Doing DNS lookup....
GetHostByName() results for 'gi.vo.tum.de':
h_error: 0 - successful DNS query
Name: w3proj1.ze.tu-muenchen.de
aliases: gi.vo.tum.de
addrtype:2
length: 4
ip: 129.187.102.4
Result: gi.vo.tum.de is a valid name.
Note: gi.vo.tum.de appears to be an alias for the machine named
w3proj1.ze.tu-muenchen.de .
Checking URL for Redirect: Trying to Connect to Site...
Result: No Server redirect detected at http://gi.vo.tum.de:80/
Checking host for Virtual Server: Trying to Connect to Site...
Result: http://gi.vo.tum.de:80/ is really a virtual server being hosted on the server
w3proj1.ze.tu-muenchen.de.
------------------------------------------------------------
To distinguish virtual hosts from server aliases, NCS simply contacts
the two addresses that were returned by "GetHostByName()":
============================================================
gi.access_log:
sunhalle1.informatik.tu-muenchen.de - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0"
200 911
tum.access_log:
sunhalle1.informatik.tu-muenchen.de - - [11/Dec/1998:12:54:37 +0100] "GET / HTTP/1.0"
200 2301
============================================================
(NCS runs on "sunhalle1"). NCS simply compares the root documents of the
two addresses. If they are the same, the alias is _possibly_ a server
alias of the server, if they are different, the alias is a virtual host.
There might be some problems, if one machine hosts several virtual
hosts, but in general that's a feature I'd _love_ to see in
ht://Dig. The last time I checked, ht://Dig indexed ~280.000 documents
in our domain, where 130.000 is a more realistic number. The
"server_aliases" directive didn't help much either. There are simply way
too much hosts to dael manually with!
Any comments to my suggestion?
-Walter Hafner
--
Walter Hafner_______________________________ [EMAIL PROTECTED]
<A href=http://www.tum.de/~hafner/>*CLICK*</A>
The best observation I can make is that the BSD Daemon logo
is _much_ cooler than that Penguin :-) (Donald Whiteside)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.