Hi Jose,
please looks at the last comment in the Jira issue
https://jira.duraspace.org/browse/DS-704?focusedCommentId=17807&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_17807
 
<https://jira.duraspace.org/browse/DS-704?focusedCommentId=17807&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_17807>
Andrea

Il 18/10/2010 18:04, Blanco, Jose ha scritto:
> I tried using this patch and I get the following message:
>
>
> 1) com.ibm.icu:icu4j:jar:3.8.1
>
>    Try downloading the file manually from the project website.
>
>    Then, install it using the command:
>        mvn install:install-file -DgroupId=com.ibm.icu -DartifactId=icu4j 
> -Dversion=3.8.1 -Dpackaging=jar -Dfile=/path/to/file
>
> Where do I get icu4j:jar:3.8.1?
>
> Thank you!
> Jose
> -----Original Message-----
> From: Tim Donohue (DuraSpace JIRA) [mailto:no-re...@duraspace.org]
> Sent: Monday, October 18, 2010 10:13 AM
> To: dspace-devel@lists.sourceforge.net
> Subject: [Dspace-devel] [DuraSpace JIRA] Updated: (DS-704) Update pdfbox 
> library to improve performance and out-of-box support for pdf extraction
>
>
>       [ 
> https://jira.duraspace.org/browse/DS-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Tim Donohue updated DS-704:
> ---------------------------
>
>      Status: Open  (was: Received)
>
> +1  I'd vote to go ahead and upgrade the version of PDFBox we are using in 
> DSpace 1.7.0.  I know there were several issues with the older version.
>
>
>
>> Update pdfbox library to improve performance and out-of-box support for pdf 
>> extraction
>> --------------------------------------------------------------------------------------
>>
>>                  Key: DS-704
>>                  URL: https://jira.duraspace.org/browse/DS-704
>>              Project: DSpace
>>           Issue Type: Improvement
>>           Components: DSpace API
>>             Reporter: Andrea Bollini
>>             Assignee: Andrea Bollini
>>              Fix For: 1.7.0
>>
>>          Attachments: dspace-pdfbox.patch
>>
>>
>> We have found that update the pdfbox library to the last stable version 
>> (1.2.1) solve all our current issues with pdf text extraction and improve 
>> performance.
>> This could help people that want rely on the DSpace "out-of-box" pdf 
>> extractor without using XPDF.
>> Below some samples of exception that go away updating the pdfbox version. 
>> Patch attached against trunk r5439
>> ==
>> java.io.IOException: Error: Could not find font(COSName{F1.0}) in map={}
>> at org.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:83)
>> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
>> ===
>> java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to 
>> org.pdfbox.cos.COSDictionary
>> at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:70)
>> at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
>> at org.pdfbox.cos.COSStream.doDecode(COSStream.java:243)
>> at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
>> at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
>> at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
>> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
>> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
>> ====
>> java.io.IOException: Unknown colorspace array type:COSName{DeviceRGB}
>> at 
>> org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:116)
>> at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264)
>> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:193)
>> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
>> ===
>> java.lang.NullPointerException
>> at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>> at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>> at 
>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
>> ===
>> java.util.zip.ZipException: unknown compression method
>> at java.util.zip.InflaterInputStream.read(Unknown Source)
>> at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
>> at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
>> at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
>> at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
>> at 
>> org.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:66)
>> at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:450)
>> at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908)
>> at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
>> ===
>> java.lang.ArrayIndexOutOfBoundsException
>> at java.lang.System.arraycopy(Native Method)
>> at java.io.PushbackInputStream.unread(Unknown Source)
>> at org.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:524)
>> at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:873)
>> at 
>> org.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:94)
>> at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:451)
>> at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908)
>> at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
>> ===
>> java.io.EOFException: Unexpected end of ZLIB input stream
>> at java.util.zip.InflaterInputStream.fill(Unknown Source)
>> at java.util.zip.InflaterInputStream.read(Unknown Source)
>> at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
>> at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
>> at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
>> at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
>> at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
>> at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
>> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
>> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>> at 
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)


-- 
Dott. Andrea Bollini
Project Manager, IT Architect&  Systems Integrator
Sezione Servizi per le Biblioteche e l'Editoria Elettronica
CILEA, http://www.cilea.it
tel. +39 06-59292853
cel. +39 348-8277525

---

Disclaimer: the content of this email is confidential and may be privileged, 
and it must not be disclosed or copied without the sender's consent. If you 
have received this message in error, please notify the sender and remove it 
from your system. The content of this email does not constitute legal advice, 
nor any responsibility is accepted for loss or damage incurred as a result of 
acting upon its contents or attachments.
The statements and opinions expressed in this email are those of the author and 
do not necessarily reflect those of the employer.


------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly 
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to