Not all PDF files are created equal. They contain different internals.

To use Adobe's taxonomy of PDF files, you have essentially three types:

* "Formatted Text and Graphics": These are typically created from
applications like Word, are structurally equivalent to PostScript, and
somewhat analagous to vector images. Text is represented internally as
text, and visual markup gives the document its look. You can use the
search tool in Acrobat Reader to search for text in these, and likewise,
Lucene can index them.

* "Image Only": These are typically created from basic scanning
applications, and are essentially bitmap images embedded in a PDF data
structure. These are not searchable or indexable by any means because
they contain no text identifiable as such.

* "Searchable Image": These are typically created from scanning
applications that have built-in OCR functionality, or by post-processing
"Image Only" PDFs. OCR is done on the image component and stored with
coordinate information within the PDF data structure. These are
searchable in Acrobat Reader and indexable by Lucene. Acrobat Reader,
because of the coordinate information, is even able to highlight search
hits in rectangles over the image where the OCRed word was found.

I suspect you have a mix of "Image Only" and "Searchable Image" type PDF
files, given your description of the project.

We do a large amount of digitization and OCR here. In migrating page
images and OCR from another environment into DSpace, we were not able to
find a good tool to both build PDFs from our existing page images and
embed our existing OCR to create "Searchable Image" PDFs. To get
full-text searching out of the entire body of materials, you are
probably going to have to do what we did, and look at OCR tools that can
operate on "Image Only: PDF files.

Arguably the best tools would be from Adobe. Acrobat Pro can do batch
OCR, essentially conversion of "Image Only" to "Searchable Image".
Acrobat Capture can also do this with greater efficiency and at greater
cost.

You'll want to make sure that whatever process you end up with, you
don't compromise the existing image quality--some tools may try to be
helpful by downsampling the existing images, or applying different
compression levels to them.

Cory Snavely
University of Michigan Library IT Core Services

On Thu, 2007-03-29 at 09:02 -0600, Shawna Sadler wrote:
> A bunch of us in Canada have received theses from Library & Archives 
> Canada (national library) where they created PDFs from microfilmed theses.
> 
> We've loaded them into DSpace and we're noticing very inconsistent 
> behavior with MediaFilter. Some of the theses have extracted text and 
> some have blank .txt files.
> 
> Thesis with successfully extracted text
> https://dspace.ucalgary.ca/handle/1880/25057
> 
> Unsuccessful- blank .txt file
> https://dspace.ucalgary.ca/handle/1880/25028
> 
> Can anyone shed some light on this issue?
> 
> Thanks,
> Shawna
> 
> Shawna Sadler
> Coordinator, Digital Initiatives
> Libraries & Cultural Resources
> University of Calgary
> Phone: (403) 220-3739
> Email: [EMAIL PROTECTED]
> 
> >  
> >
> 
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to