Re: [htdig] "deleted no excerpts " with pdf files

David Adams Fri, 19 Dec 2003 04:48:22 -0800

You say that pdf2html.pl works from the command line, but does doc2html.pl work from the command line for PDF files?

"noindex" is not relevant in the case of PDF files, but the following might be:

The PDF document contained no indexable text

The PDF document was too large - see the max_doc_size: statement

Do also consult the FAQ at <http://www.htdig.org/FAQ.html>.

David Adams

University of Southampton

----- Original Message -----

From: Dominique Fourtune

To: [EMAIL PROTECTED]

Sent: Thursday, December 18, 2003 5:19 PM

Subject: [htdig] "deleted no excerpts " with pdf files

Hello everybody, I need help

I'm using htdig 3.1.6, to parse html pages created by Apache mod-autoindex

I can't merge pdf files, I get always error message " Deleted no excerpts"

I'm using doc2html.pl, it is OK for .doc files, but not for pdf files

pdf2html.pl on command line parses pdf files and creates html files

I found this old post :

According to Paul COURBIS:
> When I run htmerge, I get a lot of messages :
> Deleted, no excerpt: xxx/http...
>
> What does it mean ? Why does htmerge suppress so many documents from the
> database ? As far as I understand english it seems that it means that
> there's no keyword for these pages, despite the fact that when I connect
> to it there's a lot of text...

The most common causes of this are:
- a noindex directive somewhere in the document
- the document was disallowed by robots.txt
- the server_max_docs limit was reached before this document could be parsed
You'd need to correlate the htmerge -v output back to the htdig -v (or -vv)
output to see which of these conditions occurred.
I think the first reason is the good one (I have no robots), but I need help to go further : what is a noindex directive ?

Thanks a lot
-- 
Dominique FOURTUNE - ADEME Département MDE
05 55 10 27 49 - [EMAIL PROTECTED]
Les ordinateurs marchent très bien sans Microsoft, et pour moins cher : passez à Linux !

Re: [htdig] "deleted no excerpts " with pdf files

Reply via email to