Word documents can be processed by Abiword into any msword document into
html, latex, postscript, text formats with very simple commands; i guess it
also exposes some api which can be integrated into document
parsers/indexers.

Spreadsheets can be processed by utilizing *ExcelFormat *library
http://www.codeproject.com/Articles/42504/ExcelFormat-Library

or * BasicExcel *library
http://www.codeproject.com/Articles/13852/BasicExcel-A-Class-to-Read-and-Write-to-Microsoft

Or even the GNU GNumeric project has some api to process spreadsheets which
can be used to extract text and index.

Code to extract text from PDF
http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file


Overall, I guess there are bits and pieces available over the internet and
some dedicated efforts are needed to assemble those and develop into a
finished product, namely document indexer.

Wish you success!

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
On Fri, Mar 16, 2012 at 2:51 AM, dennis jenkins <dennis.jenkins...@gmail.com
> wrote:

> On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis <pg...@j-davis.com> wrote:
> > On Fri, 2012-03-16 at 01:57 +0530, alexander.bager...@cognizant.com
> > wrote:
> >> Hi,
> >>
> >> We are looking to use Postgres 9 for the document storing and would
> >> like to take advantage of the full text search capabilities. We have
> >> hard time identifying MS/Open Office and PDF parsers to index stored
> >> documents and make them available for text searching. Any advice would
> >> be appreciated.
> >
> > The first step is to find a library that can parse such documents, or
> > convert them to a format that can be parsed.
>
> I don't know about MS-Office document parsing, but the "PoDoFo" (pdf
> parsing library) can strip text from PDFs.  Every now and then someone
> posts to the podofo mailing list with questions related to extracting
> text for the purposes of indexing it in FTS capable database.  Podofo
> has excellent developer support.  The maintainer is quick to accept
> patches, verify bugs, add features, etc...   Disclaimer: I'm not a pdf
> nor podofo expert.  I can't help you accomplish what you want.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Reply via email to