Sorry for the length of this!

> 
> According to David Adams:
> > Why does htmerge 3.1.5 flag some pages, which look OK to me, as 
> > "Deleted, invalid" and not index them?
> > 
> > This is happening not just with .html pages but also .doc and .pdf files.
> > 
> > It happens with a simple merge following a run of htdig -i -a
> > and also when two htdig runs are merged using the htdig -m option.
> 
> htmerge does this when the remove_bad_urls attribute is true, and the
> page in question is not found (404 error), the server name no longer
> exists, the server is down, or in the case of an update dig, the page
> has been updated, superceding the old document database record for it.
> In the latter case, htdig creates a new record for the updated document,
> with a new DocID, so the old one is discarded.  As this only happens in
> update digs, it wouldn't be the case during an htdig -i, so I'd look at
> the other possibilities.
> 
> In any case, run both htdig and htmerge with at least two verbose options,
> and cross-reference the DocID of the "Deleted, invalid" messages to other
> messages with the same ID, to get a clearer picture of what's happening.
> 
> -- 
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
> 
> 

I've run htdig -vv followed by htmerge -vvv and I still cannot see
any reason why htmerge decides, apparently arbitrarily, that a page is
invalid.  None of the reasons given above seem to fit.

I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is
one of many in the limit_urls_to directive. 

Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then
        http://www.tregalic.co.uk/sacred-heart/churchpage1.html
        http://www.tregalic.co.uk/sacred-heart/churchpage2.html
                  ...
        http://www.tregalic.co.uk/sacred-heart/churchpage7.html
amongst others.

Grepping for "churchpage" in the htmerge log I find:

htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html    
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html    
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html    
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html    
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html    
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html    
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html    
1897/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
1898/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
1899/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
1902/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
1903/http://www.tregalic.co.uk/sacred-heart/churchpage7.html

So I try an experiment: I reduce limit_urls_to include only the starting URL
and http://www.tregalic.co.uk/sacred-heart/ and run htdig & htmerge.

Then htmerge reports:

htmerge: Total word count: 3806
0/http://www.soton.ac.uk/services/local/alpha.html
1/http://www.tregalic.co.uk/sacred-heart/
9/http://www.tregalic.co.uk/sacred-heart/baptism.html
2/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
3/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
4/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
5/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
6/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
7/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
8/http://www.tregalic.co.uk/sacred-heart/churchpage7.html
htmerge: 10
12/http://www.tregalic.co.uk/sacred-heart/information.html
11/http://www.tregalic.co.uk/sacred-heart/links.html
10/http://www.tregalic.co.uk/sacred-heart/newsletter.html

I do not accept that pages 4 & 5 just happened to unavailable on the
first occasion and available on the second.  Nor can I see any
differences in the htdig logs for these pages.  The same sizes are
reported in both cases. 

I think there is a bug in htmerge 3.1.5 which causes it to declare
some pages as "invalid" in some cases.

-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

Reply via email to