btw - which version of DSpace is in use? AFAIK 5.1 already uses pdfbox 1.8.x and 2.0
> Am 01.06.2017 um 12:40 schrieb Maruan Sahyoun <sahy...@fileaffairs.de>: > > Hi, > >> Am 01.06.2017 um 12:25 schrieb RENTON Scott <scott.ren...@ed.ac.uk>: >> >> Hi Maruan, thanks for the swift response. It looks like it’s 1.6.0 (quite >> old?)- that’s certainly the .jar that’s sitting in the dspace lib directory. >> I’ve copied in George as he’s investigating this too; George, I take it >> we’re ok to send Maruan a link to the relevant records in the repository? >> > > you should really upgrade either to the latest 1.8 release or to 2.0 release > (the 1.8 API is more in line with 1.6 where 2.0 saw several changes - > development now mainly goes into 2.0). In both there were many additions when > it comes to parsing malformed PDFs. In addition - with the tremendous help of > the TIKA colleagues - text extraction is now run against a much larger test > corpus. > > You can download the pdfbox-app…jar > > http://www-us.apache.org/dist/pdfbox/2.0.6/pdfbox-app-2.0.6.jar > http://www-us.apache.org/dist/pdfbox/1.8.13/pdfbox-app-1.8.13.jar > > and run the ExtractText command line tool to verify if the issue you are > facing is still relevant with the newer versions. > > 1.6.0 has been release in Juyl 2011 - so yes, quite old. > > BR > Maruan > > >> Cheers >> Scott >> -- >> Scott Renton >> >> Digital Development >> Library and University Collections >> Argyle House, Floor F >> ext: 515219 >> >> >> >> >> >> >> >> >> On 01/06/2017 11:18, "Maruan Sahyoun" <sahy...@fileaffairs.de> wrote: >> >>> Hi Scott, >>> >>> which version of PDFBox are you using? Is it possible to share one of the >>> PDFs at a public location? >>> >>> BR >>> Maruan >>> >>>> Am 01.06.2017 um 12:11 schrieb RENTON Scott <scott.ren...@ed.ac.uk>: >>>> >>>> >>>> Hi folks (apologies- hit send too soon) >>>> >>>> We run pdfbox for pdf text extraction under the Dspace application. >>>> >>>> Occasionally we get the odd failure, and we’re investigating some errors >>>> just now. I’m just wondering what property of the PDF in question it’s >>>> looking at here, and if there’s any way we can mitigate against that. It’s >>>> certainly not the title. >>>> >>>> >>>> One is: >>>> java.lang.RuntimeException: java.io.IOException: Not a number: + >>>> java.lang.RuntimeException: java.io.IOException: Not a number: + >>>> at >>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178) >>>> at >>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187) >>>> at >>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266) >>>> at >>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) >>>> at >>>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) >>>> at >>>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) >>>> at >>>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) >>>> at >>>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) >>>> at >>>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101) >>>> >>>> >>>> And here’s another: >>>> >>>> java.lang.NumberFormatException: For input string: "dup" >>>> java.lang.NumberFormatException: For input string: "dup" >>>> at >>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) >>>> at java.lang.Integer.parseInt(Integer.java:492) >>>> at java.lang.Integer.parseInt(Integer.java:527) >>>> at >>>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344) >>>> at >>>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280) >>>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181) >>>> at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83) >>>> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152) >>>> at >>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108) >>>> at >>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java: >>>> 5) >>>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) >>>> >>>> Thanks >>>> Scott >>>> -- >>>> Scott Renton >>>> Digital Development >>>> Library and University Collections >>>> Argyle House, Floor F >>>> ext: 515219 >>>> >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org