On Tue, Sep 01, 2009 at 03:55:11PM +1000, Gary Browne wrote: > When a user searches via the dspace web interface, is the search run > across the content of text pdfs or just the metadata? If so, does the > pdf submitted to the repository need to have been previously OCR'd, or > does the repository attempt to extract & index text from all pdfs?
DSpace doesn't include OCR code. The full-text extractor (which feeds the indexing) requires actual coded-character text in the PDF to work with. If all you have is a bag of bitmaps (such as you often get from scanning paper documents into PDF) then they contain nothing useful to extract; you'll need to OCR or otherwise recover the character data before ingesting the file into DSpace. -- Mark H. Wood, Lead System Programmer mw...@iupui.edu Friends don't let friends publish revisable-form documents.
pgpXE4WuvkrNX.pgp
Description: PGP signature
------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech