Not all PDF files are created equal. They contain different internals. To use Adobe's taxonomy of PDF files, you have essentially three types:
* "Formatted Text and Graphics": These are typically created from applications like Word, are structurally equivalent to PostScript, and somewhat analagous to vector images. Text is represented internally as text, and visual markup gives the document its look. You can use the search tool in Acrobat Reader to search for text in these, and likewise, Lucene can index them. * "Image Only": These are typically created from basic scanning applications, and are essentially bitmap images embedded in a PDF data structure. These are not searchable or indexable by any means because they contain no text identifiable as such. * "Searchable Image": These are typically created from scanning applications that have built-in OCR functionality, or by post-processing "Image Only" PDFs. OCR is done on the image component and stored with coordinate information within the PDF data structure. These are searchable in Acrobat Reader and indexable by Lucene. Acrobat Reader, because of the coordinate information, is even able to highlight search hits in rectangles over the image where the OCRed word was found. I suspect you have a mix of "Image Only" and "Searchable Image" type PDF files, given your description of the project. We do a large amount of digitization and OCR here. In migrating page images and OCR from another environment into DSpace, we were not able to find a good tool to both build PDFs from our existing page images and embed our existing OCR to create "Searchable Image" PDFs. To get full-text searching out of the entire body of materials, you are probably going to have to do what we did, and look at OCR tools that can operate on "Image Only: PDF files. Arguably the best tools would be from Adobe. Acrobat Pro can do batch OCR, essentially conversion of "Image Only" to "Searchable Image". Acrobat Capture can also do this with greater efficiency and at greater cost. You'll want to make sure that whatever process you end up with, you don't compromise the existing image quality--some tools may try to be helpful by downsampling the existing images, or applying different compression levels to them. Cory Snavely University of Michigan Library IT Core Services On Thu, 2007-03-29 at 09:02 -0600, Shawna Sadler wrote: > A bunch of us in Canada have received theses from Library & Archives > Canada (national library) where they created PDFs from microfilmed theses. > > We've loaded them into DSpace and we're noticing very inconsistent > behavior with MediaFilter. Some of the theses have extracted text and > some have blank .txt files. > > Thesis with successfully extracted text > https://dspace.ucalgary.ca/handle/1880/25057 > > Unsuccessful- blank .txt file > https://dspace.ucalgary.ca/handle/1880/25028 > > Can anyone shed some light on this issue? > > Thanks, > Shawna > > Shawna Sadler > Coordinator, Digital Initiatives > Libraries & Cultural Resources > University of Calgary > Phone: (403) 220-3739 > Email: [EMAIL PROTECTED] > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech