|
You say that pdf2html.pl works from the command
line, but does doc2html.pl work from the command line for PDF
files?
"noindex" is not relevant in the case of PDF files,
but the following might be:
The PDF document contained no
indexable text
The PDF document was too large
- see the max_doc_size:
statement
David Adams
University of Southampton
----- Original Message -----
Sent: Thursday, December 18, 2003 5:19
PM
Subject: [htdig] "deleted no excerpts "
with pdf files
Hello everybody, I need help
I'm using htdig 3.1.6, to
parse html pages created by Apache mod-autoindex
I can't merge pdf
files, I get always error message " Deleted no excerpts"
I'm using
doc2html.pl, it is OK for .doc files, but not for pdf files
pdf2html.pl
on command line parses pdf files and creates html files
I found this
old post :
According to Paul COURBIS: > When I run htmerge, I get a lot
of messages : > Deleted, no excerpt: xxx/http...
> > What does it mean ? Why does htmerge
suppress so many documents from the > database ? As far as I
understand english it seems that it means that > there's no
keyword for these pages, despite the fact that when I connect
> to it there's a lot of text...
The most common causes of this are: - a noindex directive
somewhere in the document - the document was disallowed by
robots.txt - the server_max_docs limit was reached before this
document could be parsed
You'd need to correlate the htmerge -v
output back to the htdig -v (or -vv) output to see which of these
conditions occurred. I think the first reason is the good one (I
have no robots), but I need help to go further : what is a noindex directive
?
Thanks a lot
--
Dominique FOURTUNE - ADEME Département MDE
05 55 10 27 49 - [EMAIL PROTECTED]
Les ordinateurs marchent très bien sans Microsoft, et pour moins cher : passez à Linux !
|