Re: [GENERAL] Indexing MS/Open Office and PDF documents

Richard Huxton Thu, 15 Mar 2012 14:18:43 -0700

On 15/03/12 21:12, Jeff Davis wrote:

On Fri, 2012-03-16 at 01:57 +0530, [email protected]

We have
hard time identifying MS/Open Office and PDF parsers to index stored
documents and make them available for text searching.

The first step is to find a library that can parse such documents, or
convert them to a format that can be parsed.

I've used docx2txt and pdf2txt and friends to produce text files that Ithen index during the import process. An external script runs the wholeprocess. All I cared about was extracting raw text though, this doesnothing to identify headings etc.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Indexing MS/Open Office and PDF documents

Reply via email to