[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524060#comment-15524060
 ] 

Hudson commented on TIKA-2098:
------------------------------

FAILURE: Integrated in Jenkins build tika-2.x-windows #57 (See 
[https://builds.apache.org/job/tika-2.x-windows/57/])
TIKA-2098 small clean up.  Test for writelimitreached for each catchable 
(tallison: rev cde4c0aa8b668e0964f2b83fab67588292ffc993)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2098
>                 URL: https://issues.apache.org/jira/browse/TIKA-2098
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Alexander Kazakov
>            Assignee: Tim Allison
>              Labels: java, parser, pdf
>             Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>       at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>       at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>       ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>       at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>       at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>       at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>       at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>       at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>       ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>       at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       ... 51 more
> Caused by: 
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>       at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       ... 52 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to