Re: [Dspace-tech] Searching of text from PDF files

Mark H. Wood Wed, 09 Sep 2009 13:46:51 -0700

On Tue, Sep 01, 2009 at 03:55:11PM +1000, Gary Browne wrote:
> When a user searches via the dspace web interface, is the search run
> across the content of text pdfs or just the metadata? If so, does the
> pdf submitted to the repository need to have been previously OCR'd, or
> does the repository attempt to extract & index text from all pdfs?


DSpace doesn't include OCR code.

The full-text extractor (which feeds the indexing) requires actual
coded-character text in the PDF to work with.  If all you have is a
bag of bitmaps (such as you often get from scanning paper documents
into PDF) then they contain nothing useful to extract; you'll need to
OCR or otherwise recover the character data before ingesting the file
into DSpace.

-- 
Mark H. Wood, Lead System Programmer   mw...@iupui.edu
Friends don't let friends publish revisable-form documents.

pgpXE4WuvkrNX.pgp
Description: PGP signature

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Searching of text from PDF files

Reply via email to