There are a number of different versions of PDF and a number of applications that generate PDFs. Some combinations of version and application generate PDFs that are subtly misunderstood by some applications that read PDFs.
I suggest that you try to narrow down which application was used to generate the PDFs you're having difficulty with. If you can isolate a set of versions and applications that give you trouble you can then open and re-save the PDFs in a tool that doesn't have the problem. This can potentially be automated too, if you have many PDFs. We have found, for example, that PDFCreator (the windows-based PDF program that works like a print-driver) strips out the full-text when used to concatenate documents together. Once we discovered this it was a relatively simple matter to adjust our workflow to compensate for the problem and catch the few bad PDFs that had already made it through into the collection. cheers stuart Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > I found out something very interesting this weekend. I took a .pdf file > that was "unfilterable"; in other words filter-media displayed an error > like this: > > "ERROR filtering, skipping bitstream #21220 java.io.IOException: Error: > value is not an integer type actual='--20'" > > > > On a hunch, I looked at the document and found it had several pages of > graphics/images in it. I deleted all pages in the document, which > contained images and guess what? It filtered just fine. > > > > Hmmm…we have to be able to upload documents that contain images. NASA > has a LOT of images in their documents. Now what?? > > > > Sue Walker-Thornton > > NASA Langley Research Center > > (757) 224-4074 > > > > -----Original Message----- > From: Graham Triggs [mailto:[EMAIL PROTECTED] > Sent: Friday, October 24, 2008 3:13 PM > To: [email protected] > Subject: Re: [Dspace-tech] filter-media problem - question on size limit > > > > If anyone has example PDFs that cause the text extraction to fail > > (smaller PDFs preferably!) that they are able to share, please send them > > - or a link to retrieve them - to me. > > > > Thanks, > > G > > > > Mark H. Wood wrote: > >> I found this: > >> > >> http://java-source.net/open-source/pdf-libraries > >> > >> PJX and PDF Jester look, at first glance, as though they might be > >> worth considering. > >> > >> OTOH it looks like PDFBox might be getting more attention in its new > >> home, and if so, then it makes sense to stick with it and help to > >> improve it. > >> > >> > >> > >> ------------------------------------------------------------------------ > >> > >> ------------------------------------------------------------------------- > >> This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > >> Build the coolest Linux based applications with Moblin SDK& win great > prizes > >> Grand prize is a trip for two to an Open Source event anywhere in the > world > >> http://moblin-contest.org/redirect.php?banner_id=100&url=/ > >> > >> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> DSpace-tech mailing list > >> [email protected] > >> https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > > This email has been scanned by Postini. > > For more information please visit http://www.postini.com > > > > > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > > Build the coolest Linux based applications with Moblin SDK & win great > prizes > > Grand prize is a trip for two to an Open Source event anywhere in the world > > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > _______________________________________________ > > DSpace-tech mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/dspace-tech > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > > > ------------------------------------------------------------------------ > > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Stuart Yeates Te Pātaka Kōrero o Te Whare Wānanga o te Ūpoko o te Ika a Māui http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

