[GENERAL] Indexing MS/Open Office and PDF documents

2012-03-15 Thread Alexander.Bagerman
Hi,

We are looking to use Postgres 9 for the document storing and would like
to take advantage of the full text search capabilities. We have hard
time identifying MS/Open Office and PDF parsers to index stored
documents and make them available for text searching. Any advice would
be appreciated.

Regards,

-Alex


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful.


Re: [GENERAL] Indexing MS/Open Office and PDF documents

2012-03-15 Thread Jeff Davis
On Fri, 2012-03-16 at 01:57 +0530, alexander.bager...@cognizant.com
wrote:
 Hi,
 
 We are looking to use Postgres 9 for the document storing and would
 like to take advantage of the full text search capabilities. We have
 hard time identifying MS/Open Office and PDF parsers to index stored
 documents and make them available for text searching. Any advice would
 be appreciated.

The first step is to find a library that can parse such documents, or
convert them to a format that can be parsed.

After you do that, PostgreSQL allows you to load arbitrary code as
functions (in various languages), so that will allow you to make use of
the library. It's hard to give more specific advice until you've found
the library you'd like to work with.

Regards,
Jeff Davis



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing MS/Open Office and PDF documents

2012-03-15 Thread Richard Huxton

On 15/03/12 21:12, Jeff Davis wrote:

On Fri, 2012-03-16 at 01:57 +0530, alexander.bager...@cognizant.com



We have
hard time identifying MS/Open Office and PDF parsers to index stored
documents and make them available for text searching.



The first step is to find a library that can parse such documents, or
convert them to a format that can be parsed.


I've used docx2txt and pdf2txt and friends to produce text files that I 
then index during the import process. An external script runs the whole 
process. All I cared about was extracting raw text though, this does 
nothing to identify headings etc.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing MS/Open Office and PDF documents

2012-03-15 Thread dennis jenkins
On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2012-03-16 at 01:57 +0530, alexander.bager...@cognizant.com
 wrote:
 Hi,

 We are looking to use Postgres 9 for the document storing and would
 like to take advantage of the full text search capabilities. We have
 hard time identifying MS/Open Office and PDF parsers to index stored
 documents and make them available for text searching. Any advice would
 be appreciated.

 The first step is to find a library that can parse such documents, or
 convert them to a format that can be parsed.

I don't know about MS-Office document parsing, but the PoDoFo (pdf
parsing library) can strip text from PDFs.  Every now and then someone
posts to the podofo mailing list with questions related to extracting
text for the purposes of indexing it in FTS capable database.  Podofo
has excellent developer support.  The maintainer is quick to accept
patches, verify bugs, add features, etc...   Disclaimer: I'm not a pdf
nor podofo expert.  I can't help you accomplish what you want.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing MS/Open Office and PDF documents

2012-03-15 Thread Samba
Word documents can be processed by Abiword into any msword document into
html, latex, postscript, text formats with very simple commands; i guess it
also exposes some api which can be integrated into document
parsers/indexers.

Spreadsheets can be processed by utilizing *ExcelFormat *library
http://www.codeproject.com/Articles/42504/ExcelFormat-Library

or * BasicExcel *library
http://www.codeproject.com/Articles/13852/BasicExcel-A-Class-to-Read-and-Write-to-Microsoft

Or even the GNU GNumeric project has some api to process spreadsheets which
can be used to extract text and index.

Code to extract text from PDF
http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file


Overall, I guess there are bits and pieces available over the internet and
some dedicated efforts are needed to assemble those and develop into a
finished product, namely document indexer.

Wish you success!


On Fri, Mar 16, 2012 at 2:51 AM, dennis jenkins dennis.jenkins...@gmail.com
 wrote:

 On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis pg...@j-davis.com wrote:
  On Fri, 2012-03-16 at 01:57 +0530, alexander.bager...@cognizant.com
  wrote:
  Hi,
 
  We are looking to use Postgres 9 for the document storing and would
  like to take advantage of the full text search capabilities. We have
  hard time identifying MS/Open Office and PDF parsers to index stored
  documents and make them available for text searching. Any advice would
  be appreciated.
 
  The first step is to find a library that can parse such documents, or
  convert them to a format that can be parsed.

 I don't know about MS-Office document parsing, but the PoDoFo (pdf
 parsing library) can strip text from PDFs.  Every now and then someone
 posts to the podofo mailing list with questions related to extracting
 text for the purposes of indexing it in FTS capable database.  Podofo
 has excellent developer support.  The maintainer is quick to accept
 patches, verify bugs, add features, etc...   Disclaimer: I'm not a pdf
 nor podofo expert.  I can't help you accomplish what you want.

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general