Author: Alexander Barkov
Email: [EMAIL PROTECTED]
Message:
> I have the same problem. I have been using 'pdftotext'. But if you look at the text 
>files that this produces, there is no meta info included, so mnogosearch is never 
>going to extract meaningful titles. The solution, I think, is to convert the pdf file 
>to html. 
> The HtDig mailing lists recommend using a program called 'doctohtml'...
> http://www.htdig.org/files/contrib/parsers/
> 
> This is a perl wrapper script which converts pdf,msword,wordperfect and others to 
>html. 
> 
> When converting pdf files, it uses pdfinfo to extract meta data (title, keywords 
>etc)from the pdf file, and generates the html <head> info and pdftotext to 
>generate the html <body> info. 
> 
> So doc2html seems like the ideal solution BUT I can't get it to work, and wondered 
>if anyone else had got any tips on how to use it.


It seems to be really best sollution. Taking a look into it's
README I noticed that doc2html takes two arguments: the first
one is absolute file name (btw relative didn't work for me)
and the second one is content type. In the case of PDF documents
we have to pass application/pdf. As far as indexer does not pass
content type when calling external parser, the idea is to write a shell script 
doc2html.sh like this:

#!/bin/sh
DOC2HTML=/path/to/doc2html.pl
$DOC2HTML $1 "application/pdf" 2>/dev/null


Then make it executable and use this program as an external parser
of type FILE->STDOUT:

Mime application/pdf text/plain  "/path/to/doc2html.sh $1"


Hope this helps




Reply: <http://search.mnogo.ru/board/message.php?id=2111>

___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]

Reply via email to