Word documents can be processed by Abiword into any msword document into html, latex, postscript, text formats with very simple commands; i guess it also exposes some api which can be integrated into document parsers/indexers.
Spreadsheets can be processed by utilizing *ExcelFormat *library http://www.codeproject.com/Articles/42504/ExcelFormat-Library or * BasicExcel *library http://www.codeproject.com/Articles/13852/BasicExcel-A-Class-to-Read-and-Write-to-Microsoft Or even the GNU GNumeric project has some api to process spreadsheets which can be used to extract text and index. Code to extract text from PDF http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file Overall, I guess there are bits and pieces available over the internet and some dedicated efforts are needed to assemble those and develop into a finished product, namely document indexer. Wish you success! ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ On Fri, Mar 16, 2012 at 2:51 AM, dennis jenkins <dennis.jenkins...@gmail.com > wrote: > On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis <pg...@j-davis.com> wrote: > > On Fri, 2012-03-16 at 01:57 +0530, alexander.bager...@cognizant.com > > wrote: > >> Hi, > >> > >> We are looking to use Postgres 9 for the document storing and would > >> like to take advantage of the full text search capabilities. We have > >> hard time identifying MS/Open Office and PDF parsers to index stored > >> documents and make them available for text searching. Any advice would > >> be appreciated. > > > > The first step is to find a library that can parse such documents, or > > convert them to a format that can be parsed. > > I don't know about MS-Office document parsing, but the "PoDoFo" (pdf > parsing library) can strip text from PDFs. Every now and then someone > posts to the podofo mailing list with questions related to extracting > text for the purposes of indexing it in FTS capable database. Podofo > has excellent developer support. The maintainer is quick to accept > patches, verify bugs, add features, etc... Disclaimer: I'm not a pdf > nor podofo expert. I can't help you accomplish what you want. > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >