btw - which version of DSpace is in use? AFAIK 5.1 already uses pdfbox 1.8.x 
and 2.0

> Am 01.06.2017 um 12:40 schrieb Maruan Sahyoun <sahy...@fileaffairs.de>:
> 
> Hi,
> 
>> Am 01.06.2017 um 12:25 schrieb RENTON Scott <scott.ren...@ed.ac.uk>:
>> 
>> Hi Maruan, thanks for the swift response. It looks like it’s 1.6.0 (quite 
>> old?)- that’s certainly the .jar that’s sitting in the dspace lib directory. 
>> I’ve copied in George as he’s investigating this too; George, I take it 
>> we’re ok to send Maruan a link to the relevant records in the repository?
>> 
> 
> you should really upgrade either to the latest 1.8 release or to 2.0 release 
> (the 1.8 API is more in line with 1.6 where 2.0 saw several changes - 
> development now mainly goes into 2.0). In both there were many additions when 
> it comes to parsing malformed PDFs. In addition - with the tremendous help of 
> the TIKA colleagues - text extraction is now run against a much larger test 
> corpus.
> 
> You can download the pdfbox-app…jar
> 
> http://www-us.apache.org/dist/pdfbox/2.0.6/pdfbox-app-2.0.6.jar
> http://www-us.apache.org/dist/pdfbox/1.8.13/pdfbox-app-1.8.13.jar
> 
> and run the ExtractText command line tool to verify if the issue you are 
> facing is still relevant with the newer versions.
> 
> 1.6.0 has been release in Juyl 2011 - so yes, quite old.
> 
> BR
> Maruan
> 
> 
>> Cheers
>> Scott
>> -- 
>> Scott Renton
>> 
>> Digital Development
>> Library and University Collections
>> Argyle House, Floor F
>> ext: 515219
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 01/06/2017 11:18, "Maruan Sahyoun" <sahy...@fileaffairs.de> wrote:
>> 
>>> Hi Scott,
>>> 
>>> which version of PDFBox are you using? Is it possible to share one of the 
>>> PDFs at a public location?
>>> 
>>> BR
>>> Maruan
>>> 
>>>> Am 01.06.2017 um 12:11 schrieb RENTON Scott <scott.ren...@ed.ac.uk>:
>>>> 
>>>> 
>>>> Hi folks (apologies- hit send too soon)
>>>> 
>>>> We run pdfbox for pdf text extraction under the Dspace application.
>>>> 
>>>> Occasionally we get the odd failure, and we’re investigating some errors 
>>>> just now. I’m just wondering what property of the PDF in question it’s 
>>>> looking at here, and if there’s any way we can mitigate against that. It’s 
>>>> certainly not the title.
>>>> 
>>>> 
>>>> One is:
>>>> java.lang.RuntimeException: java.io.IOException: Not a number: +
>>>> java.lang.RuntimeException: java.io.IOException: Not a number: +
>>>> at 
>>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
>>>> at 
>>>> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
>>>> at 
>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
>>>> at 
>>>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>>>> at 
>>>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>>>> at 
>>>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>>>> at 
>>>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>>>> at 
>>>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>>>> at 
>>>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101)
>>>> 
>>>> 
>>>> And here’s another:
>>>> 
>>>> java.lang.NumberFormatException: For input string: "dup"
>>>> java.lang.NumberFormatException: For input string: "dup"
>>>> at 
>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>> at java.lang.Integer.parseInt(Integer.java:492)
>>>> at java.lang.Integer.parseInt(Integer.java:527)
>>>> at 
>>>> org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>>>> at 
>>>> org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>>>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>>>> at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>>>> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>>>> at 
>>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>>>> at 
>>>> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:
>>>> 5)
>>>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>>>> 
>>>> Thanks
>>>> Scott
>>>> -- 
>>>> Scott Renton
>>>> Digital Development
>>>> Library and University Collections
>>>> Argyle House, Floor F
>>>> ext: 515219
>>>> 
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>> 
>> 
>> -- 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to