Re: [htdig] yet another pdf parser

David Adams Tue, 11 Sep 2001 01:05:43 -0700
> > > See
> > >
> > >   ftp://ftp.htdig.org/pub/htdig/contrib/parsers/doc2html.tar.gz
> > >
> > > for the latest and fanciest incarnation of these.
> >
> > I found it out that was doing to much things I didn't need. Why should I
> > use the same script for different types when htdig is able to
> > choose the right one.
>
> It depends on what you're indexing.  Sure, for PDFs, they are usually
tagged
> unambiguously by the server, so htdig can pick the right converter/parser.
> The trick is the .doc files, which may be WP, Word, RTF, or something
else,
> so having one script that looks at both the "magic number" at the start of
> the document as well as the server's returned Content-Type header can be a
> real benefit.
>

Doc2html may be over the top for those only interested in Adobe documents,
but the tar file does include a pdf2html.pl Perl script which uses pdftotext
and pdfinfo, and can be used independently of doc2html.pl. It might be a
good starting point for someone developing their own PDF converter.

David Adams
Southampton University


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] yet another pdf parser

Reply via email to