According to Fogaras Daniel:
> I am planning to use the inverted file db.worddump with db.docs computed
> by HtDig using the option -t. However, the
> file db.docs file contains incorrect lines. For example, it contains the
> following line:
> 
> 231     u:http://www.mit.edu/people/asundqui/home.html  t:IOA Programs
> a:0     m:1000759788    s:1218  H: PAPERS * A description of four
> algorithms written in IOA with accompanying graph data types dvi , ps > *
> A description of IOA code for a modified version of a spanning tree
> algorithm ...
> 
> This line is incorrect since the excerpt written after the "H:" flag
> belongs to another page. (Actually, the page containing the text is
> http://www.mit.edu/people/cluhrs/, which was also digged by HtDig.)
> 
> Has anyone faced similar problem? Is it possible that the problem is
> caused by excerpts of binary files which are not parsable? How should I
> configure HtDig to avoid such lines?

OK, first of all make sure that you're running the latest snapshot of
htdig 3.2.0b4, to take advantage of the latest bug fixes.  If you're
running 3.2.0b3 or earlier, or even an early snapshot of b4 (from this
spring or summer), you may have problems with database corruption.

Having said that, you may want to try another test for database
corruption.  Try searching for words in the document above, using
htsearch, and see if you get the same incorrect excerpt for this document.
If htsearch reports a different excerpt than what htdig -t dumped out,
then this would suggest a bug in the code causing inconsistent reporting
of excerpts.  If both report the same excerpt, it would suggest that
somehow the record in db.docdb for this document is somehow referring
to the wrong record in db.excerpts.  Both of these are keyed on the
document ID, so if one file points to the wrong entry in the other,
it would suggest a problem with database corruption or mishandling of
document IDs in the code.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to