Alice, I did some digging around and it turns out that DSpace is using PDFBox to do the text extraction. Back in 2007, this bug was reported in PDFBox:
https://issues.apache.org/jira/browse/PDFBOX-234 <https://issues.apache.org/jira/browse/PDFBOX-234>And it looks to have been fixed in PDFBox Version 0.8.x. Our installed version of DSpace (1.6.1) is using PDFBox version 0.7.3. In digging through the DSpace Jira site, I found this, which indicates that this problem is fixed in DSpace 1.7.1 https://jira.duraspace.org/browse/DS-704 DSpace now includes a much later version of PDFBox (1.2.1). <https://jira.duraspace.org/browse/DS-704>I guess it's time to upgrade! --Joel Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | [email protected]<mailto:[email protected]> On Apr 12, 2011, at 3:53 PM, Platt, Alice wrote: I have also run across this problem – it seems like even though my PDFs have readable text, DSpace chooses to OCR the text on its own, resulting in a lot of errors. Alice Platt Digital Initiatives Librarian Shapiro Library Southern New Hampshire University 2500 North River Rd Manchester, NH 03106 USA From: Hutchinson, Alvin [mailto:[email protected]] Sent: Tuesday, April 12, 2011 2:30 PM To: '[email protected]<mailto:'[email protected]>' Cc: Richard, Joel M Subject: [Dspace-general] Filter Media Text Error In recent weeks we have uploaded content (PDF) that produces some strange text when filter-media is run. The text in the PDF is selectable and readable but the corresponding *.txt file created by filter-media has removed all spaces between words. So we are unable to search for certain words (e.g. scientific plant or animal names) because the terms are all run together in one string. I have attached both files, but if they are not transmitted due to listserv software, etc. an example is below. My question: Has anyone else run across this or can anyone tell me what the problem is? I once thought it was the manner in which these files were scanned, but I am able to select, copy and paste the text from the PDF and it maintains word and character spacing. The PDF reads, for example: larval stages of the Xanthidae are better known than those of any other family of the Brachyura. This doubtless is due to the fact that the adults habitually are found in shallow water near the shore and usually are very abundant. Ovigerous females may be taken without trouble, and thus the early zoeal stages may be known with certainty. But the lines from the corresponding *.txt file shows larvalstagesoftheXanthidaearebetterknownthanthoseofanyotherfamilyoftheBrachyura.Thisdoubtlessisduetothefactthattheadultshabituallyarefoundinshallowwaterneartheshoreandusuallyareveryabundant.Ovigerousfemalesmay betakenwithouttrouble,andthustheearlyzoealstagesmaybeknownwithcertainty Thanks in advance for any help Alvin Hutchinson Smithsonian Institution Libraries (202) 633-1031 Please consider the environment before printing this e-mail. ------------------------------------------------------------------------------ Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails _______________________________________________ Dspace-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-general
