Geoff,
Yes, yes, and - yes: 3.1.6, too!
It is still reproducable and now I give up to ask myself.
- some 3.2.0b3 release (built on OS X Server 1.2)
- htdig-3.2.0b4-20021103.tar.gz. (built on OS X Server 1.2)
- with update indexing
- creating a completely new database index
- AND!!! any 3.1.6 release on a customer NT box.
- All in common: index on at least two domains.
A few month ago I saw this problem again with the older
release b3. It dissappeared while we changed a few parameters
regarding the document ranking, rolled out again all the
sites from our CVS, and created a new index (including some
customer domains). So I suspected, the problem was caused by
inconsistent config files and/or database indices.
If you will find it useful for debugging, here is a list
of domains, we currently index - meanwhile I excluded all
customer domains from the index to find the "abnormal"
domain:
http://www.bwmc.net
http://archibald.bwmc.net
http://www.casadelsol.info
http://www.video-4-all.info
If you want to have a look at the problem, simply look at the
last domain www.video-4-all.info and look for the search phrase
"AVI". The index is a fresh one built this morning.
My observation:
You will be prompted with a few documents, where the first
two documents are completely wrong. The EXCERPT of the second
one should be the first, cause it is associated with the URL
[...]/avi.html.
The second result - I don't know where it got its ranking. The
document related with this URL does not contain the word "AVI"
and the EXCERPT might be associated with "any_doc.html".
On page three you'll be prompted with two other cases. Here
a document (antiblooming.html) is shown with wrong EXCERPT,
the last document is wrong too: the doc associated with the
URL does not contain "AVI", so maybe the EXCERPT is correct?
Page four finally shows the EXCRPT of "antiblooming.html"
with wrong URL...
...and so on. Chaos everywhere. Check any other search phrase.
This is just an example. I get similar results when I restrict
index/search to be "exact" (no metaphone, no endings, etc.).
Never mind the ranking (e.g. backlink_factor):
Even if we change all factors to be default values and create a
fresh index, everything is bad: ranking changes, and the mixed
URL/EXCERPT "pairs" simply are located at other position within
the search list.
If you index this site alone, the search results behave normal
and all (most?) documents are listed correctly. I checked this
on a pre-production server.
Sorry, but can't give you more hints. I'm just looking closer
at the HTML itself, cause I suspect any abnormal HTML tags, and
htdig is not robust enough?
If you are able to provide a fix, tell me soon!
Regards,
Thilo
Geoff Hutchison schrieb:
On Monday, November 11, 2002, at 06:55 AM, Thilo Bauer wrote:
Release: htdig 3.2.0b3 and 3.2.0b4
As reported earlier we find search mismatch when indexing
more than one domain.
Right. But saying "3.2.0b4" doesn't tell me much. Can you reproduce this
on one of the latest snapshots, or does this only occur with older
snapshots?
<http://www.htdig.org/files/snapshots/>
Does it seem to be happening consistently with update indexing?
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
--
bluewater multimedia concepts
Beratungsunternehmen f�r Anwendungsentwicklung,
Medien und Kommunikation
Dipl.-Phys. Thilo Bauer
K�lnstra�e 191
D-53757 St. Augustin
Tel.: (02241) 25 05 8 - 0
Fax: (02241) 25 05 8 - 29
http://www.bwmc.net
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html